Unstructured Data

  • Comments posted to this topic are about the item Unstructured Data

  • There is a great distinction that should be made between the terms "unstructured" and "non-discrete".  In the example of a video or audio file, most folks call that unstructured which, in my humble opinion, is totally incorrect.  For example, an MP3 file is actually highly structured which is what allows it to be processed by numerous applications to replay and/or edit the file.  However, data represented within the file (such as words spoken within the file) are not discrete and cannot be easily identified.  It is for that reason that many folks refer to it as unstructured.  A truly unstructured file could be a binary blob with an unknown or unrecognized structure.

    Even in the database community there is a large amount of non-discrete data.  The healthcare industry is awash in such data stored in audio dictation notes, free-form text fields, etc.  It is structured in the sense that it can be parsed, but non-discrete in that the data is not readily identifiable or consistent for a particular purpose.  That is the huge challenge that faces us.  For most healthcare providers, as long as they can read free-form text they can glean information from it, but it's not so easy for software to turn free-form text into discrete data for more efficient and effective processing.

    The editorial raised a valid point that data structured even within a database table may not be discrete enough for a given application.  All structure is determined for a given purpose and such structure may not be suitable for any other purpose.  For example, an MP3 file structure is useless for text processing.  Similarly, even data organized in an RDBMS for a particular purpose may be in fact useless for another purpose.  Structure is quite relevant to the purpose at hand and in fact could limit data application for other purposes.  Then again, these are the challenges that keep folks like us gainfully employed and on our toes! 🙂

  • A good amount of my JSON data is highly structured in the sense, I do necessarily know where the data is and how to separate it. Thus, the file format in most cases, has little to do whether the data is structured or non-structured. It's mostly around the methodology of how that data was constructed and it's consistency.

    Unfortunately, there are times when you as a human can see how the data can be found and how to separate it, but the tools you are using cannot. Therefore, to quote Steve, you run into data where you the machine doesn't necessarily know where the data is, or how to separate it. But can you consider it unstructured because you don't have a tool that can do it for you?

    Feeding off Aaron here, it really depends on the application or the end use. While it's true that a Word or PDF document may be unstructured to you, if the documents contain the English language that is grammatically correct, then you do have some guidance on how to separate the data through punctuation. Just with this post and this thread, every sentence likely begins with a capital word and ends with a period. To a data scientist who is analyzing language and wants to run each sentence into an algorithm, this thread is structured data.

  • Thanks for the editorial! I think it's always helpful to revisit working definitions of terms. As far as unstructured data, I have always taken that as a shorthand description for non-relational or non-normalized data. 

    Whether it is accepted as that shorthand, I don't know. If it confuses more than it helps, maybe someone has to more clearly define what they mean by "unstructured" and go from there. If they mean a shorthand description for non-relational or non-normalized data then maybe switch to those terms. If they mean something else, figure out what that something else is and switch to those terms for the remainder of the discussion.

    Just my two cents (not adjusted for inflation lol).
    - webrunner

    -------------------
    A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
    Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html

  • How about "unnormalized" instead of unstructured? As many have said there's really no such thing as unstructured data.

    Take Word files, for instance. Word files are *very* structured, it's just a hierarchal structure, not a relational one. MP3s, JPGs, even AutoCAD files are all structured.

    Given the proper API you can find whatever you want in any given file, be it word, XML, or whatever.

    Whether that API is suited to efficient keyword searches is another question altogether... :hehe:

  • It's definitely a verbiage issue.

    I say Word or an MP3 is unstructured as I can't discern the information readily. There is a format to the file, which means I can render it with an application, but the data isn't structured. Data being the items I extract to produce information. I get what Aaron, Roger, and xsevensinzx are saying. It's a question of whether you relate structured or unstructured to the format or the information. I'd say the latter.

    I think XML/JSON are semi-structured in that I know there data is, but I don't necessarily know where all the data is, or even if those elements exist. They can be ragged, or have extenions. A structured format would ensure that all rows/nodes/elements are of a consistent structure. I could always query the "AlbumReleaseDate" in an Album relational table, but in JSON/XML, that node/element might not exist for a particular item, so I have to account for it existing or not. Or I'd just ignore the "AlbumReReleaseDate" in a node because I'm not aware that was added to the document.

  • With the idea of coming up with a "better term" for such data, I've always kept it simple... "Other Junk".   When I need to be more politically correct, I call it "OPS data", the clean version of which stands for "Other People's Stuff".  I'm pretty sure you can guess what the DBA version of that would be. :exclamation:

    At least it has the lastest shinny buzzword of "OPS" in it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Thursday, October 26, 2017 9:30 AM

    With the idea of coming up with a "better term" for such data, I've always kept it simple... "Other Junk".   When I need to be more politically correct, I call it "OPS data", the clean version of which stands for "Other People's Stuff".  I'm pretty sure you can guess what the DBA version of that would be. :exclamation:

    Leave it to you Jeff to look at things from that perspective! 😛

    I also suppose that it matters what information you're interested in as to whether or not the data is discrete.  For an MP3 file, if you're after audio characteristics then the information is structured.  If you're after the text representation of the spoken word in an MP3 file, it's unstructured just as a free-form text is non-discrete per it's content.  In other words, it's all relative!

  • >>> Is “unstructured data” a bad term?
    >>> You might disagree, but give me a better term to describe there the information is stored in a data format.

    ...to intelligently answer such types of questions, one may greatly benefit from first more precisely defining what “a bad term” means (in any given context).

    There may be contexts in which such “unstructured data” terminology may be intelligible and even aid understanding; and others in which the term may serve to:
    i) actively inhibit arguably “better” ways of understanding how more relationally sound products may function and
    ii) promote or enforce arguably somewhat “backward” ways of understanding how current DBMS products may or may not be reasonable approximations of relational or other data model based DBMSs.

    Given that, something similar to either of the following terms may potentially be superior to the term “unstructured data” in various technical contexts:

    •Operationally unclosed <datum or data>
    •Operationally incomplete <datum or data>

    Hope this helps
    SM

  • An email or MP3 may have a header and other structured meta-data, but the contents typically follow no formal EDI structure. Even the contents may not conform to proper grammatical structure. A lot of IT talent and resources are wasted trying to data mine meaningful information from Twitter and Facebook.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell - Friday, October 27, 2017 6:53 AM

    An email or MP3 may have a header and other structured meta-data, but the contents typically follow no formal EDI structure. Even the contents may not conform to proper grammatical structure. A lot of IT talent and resources are wasted trying to data mine meaningful information from Twitter and Facebook.

    I wouldn't say wasted if it's relevant. You're also comparing apples to oranges here. Facebook and Twitter give marketers the ability to gain insights into another form of word of mouth. While your opinion may think it's a waste for data guys to spend time trying to tap into that data, others may not because of it's relevance. 

    That would be apples. The oranges would be someone sending you data that is not conformed and not structured where your IT talent and resources are wasting time trying to extract, transform, and load that data due to someone elses inability to comply to a proper means of data delivery.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply