Unstructured data sources into a structured DW

  • I'm starting to see more data sources from JSON based API's (mostly 3rd party). JSON must make the app developers life easier but it makes my life as a BI dev hell.

    We have to apply a model but the issue is we don't know the JSON keys until we see the data. I have no primary key in the source data.

    I'm interested in learning how best to solve this problem from the DW perspective.

    We can create a model that works today but may be broke tomorrow.

    The bigger problem is that in the BI domain we're quite far from the App domain, hence it's almost impossible to get any sort of data contract.

    I'm wondering how other DW folks have handled this?

    Best

    Lee

  • leehbi (7/4/2016)


    We can create a model that works today but may be broke tomorrow.

    The bigger problem is that in the BI domain we're quite far from the App domain, hence it's almost impossible to get any sort of data contract.

    I'm wondering how other DW folks have handled this?

    I assume you mean other that putting on your favorite postal uniform and paying them a personal visit to show them what it means to have consistent and reliable data sources?

    It doesn't matter if it's JSON, XML, CSV, TSV, Fixed Field, or what. This is always going to be a problem if the source of the data is provided by people that simply don't get it. If they can't be compelled to provide consistent and reliable data, then it's all on you to make sense of it and to try to automate as much of the correction for lunacy as possible and to detect and report on spurious shortcomings.

    The only way that can be done is for someone to write some code on your side to do some conversions. Unless you're using SQL Server 2016, which has some decent JSON shredding abilities, having someone write a JSON shredder to create the equivalent data as one or more flattened TSV (Tab Separated Values) files will likely be the most effective route. If they can make the equivalent of an EAV (Entity, Attribute, Value) file that you can easily import, then it should be fairly easy to "map" the data to the appropriate column using either Dynamic CROSSTABs or PIVOTs.

    The best thing to do, though, is to have a real "come to Jesus" meeting with the people that rely on the data and have THEM compel the data-vomit spewers to at least be consistent. If they value the data, they will lend a hand in making that happen. You don't need to be the only one that should put on a postal uniform. DO BE VERY WELL PREPARED to cough up a design that you want them to follow along with the reasons why that design should be followed to a "T". I strongly recommend against any tag based or hierarchical format such as JSON and XML. It's just not worth the aggravation to them, you, or the pipe that the data has to travel across.

    Heh... in case you couldn't tell, I don't tolerate such stupidity lightly (not you... the people providing the data). Data is the key to the company's success. It's about time the rest of the folks involved get seriously religious about data transmissions and adopt the same policy. There also needs to be a policy where no unapproved changes will occur. Everyone needs time to adapt their code, tables, and reports to changes. They wouldn't like or tolerate it if the shoe were on the other foot.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks Jeff. I've already feedback to data owners. I'm going down the route of parsing json to SQL and un-pivoting.

  • You bet and thanks for the feedback. Sorry I don't know of a better answer. These types of problems are really tough. Heh... how can you find the white car when they've painted it red? 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (7/4/2016)


    leehbi (7/4/2016)


    We can create a model that works today but may be broke tomorrow.

    The bigger problem is that in the BI domain we're quite far from the App domain, hence it's almost impossible to get any sort of data contract.

    I'm wondering how other DW folks have handled this?

    I assume you mean other that putting on your favorite postal uniform and paying them a personal visit to show them what it means to have consistent and reliable data sources?

    It doesn't matter if it's JSON, XML, CSV, TSV, Fixed Field, or what. This is always going to be a problem if the source of the data is provided by people that simply don't get it. If they can't be compelled to provide consistent and reliable data, then it's all on you to make sense of it and to try to automate as much of the correction for lunacy as possible and to detect and report on spurious shortcomings.

    The only way that can be done is for someone to write some code on your side to do some conversions. Unless you're using SQL Server 2016, which has some decent JSON shredding abilities, having someone write a JSON shredder to create the equivalent data as one or more flattened TSV (Tab Separated Values) files will likely be the most effective route. If they can make the equivalent of an EAV (Entity, Attribute, Value) file that you can easily import, then it should be fairly easy to "map" the data to the appropriate column using either Dynamic CROSSTABs or PIVOTs.

    The best thing to do, though, is to have a real "come to Jesus" meeting with the people that rely on the data and have THEM compel the data-vomit spewers to at least be consistent. If they value the data, they will lend a hand in making that happen. You don't need to be the only one that should put on a postal uniform. DO BE VERY WELL PREPARED to cough up a design that you want them to follow along with the reasons why that design should be followed to a "T". I strongly recommend against any tag based or hierarchical format such as JSON and XML. It's just not worth the aggravation to them, you, or the pipe that the data has to travel across.

    Heh... in case you couldn't tell, I don't tolerate such stupidity lightly (not you... the people providing the data). Data is the key to the company's success. It's about time the rest of the folks involved get seriously religious about data transmissions and adopt the same policy. There also needs to be a policy where no unapproved changes will occur. Everyone needs time to adapt their code, tables, and reports to changes. They wouldn't like or tolerate it if the shoe were on the other foot.

    Preach it, brother!!!

  • leehbi (7/4/2016)


    I'm starting to see more data sources from JSON based API's (mostly 3rd party). JSON must make the app developers life easier but it makes my life as a BI dev hell.

    We have to apply a model but the issue is we don't know the JSON keys until we see the data. I have no primary key in the source data.

    I'm interested in learning how best to solve this problem from the DW perspective.

    We can create a model that works today but may be broke tomorrow.

    The bigger problem is that in the BI domain we're quite far from the App domain, hence it's almost impossible to get any sort of data contract.

    I'm wondering how other DW folks have handled this?

    Best

    Lee

    So, they just toss JSON objects over the fence and ask you to "warehouse" it? With no requirements about data elements or case usage for the end users, then feel free to just store it all into a folder on the staging server, at least for now.

    I mean, what does a bakery do with an airplane propeller when it gets dumped on the loading dock with no specific instructions?

    For what it's worth, MongoDB is designed to warehouse JSON objects, and it supports integration with SQL Server. If the app developers need to persist json objects long term for some reason, then maybe convice them to go that route and just leave it out of the relational data warehouse.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • There's a few articles on JSON here, including some pre-SQL 2016 - http://www.sqlservercentral.com/articles/SQLCLR/74160/

  • leehbi (7/4/2016)


    I'm starting to see more data sources from JSON based API's (mostly 3rd party). JSON must make the app developers life easier but it makes my life as a BI dev hell.

    We have to apply a model but the issue is we don't know the JSON keys until we see the data. I have no primary key in the source data.

    I'm interested in learning how best to solve this problem from the DW perspective.

    We can create a model that works today but may be broke tomorrow.

    The bigger problem is that in the BI domain we're quite far from the App domain, hence it's almost impossible to get any sort of data contract.

    I'm wondering how other DW folks have handled this?

    Best

    Lee

    I handle this by using Python scripts that are called by SSIS or stored procs during the ETL process. It's very easy and requires careful planning depending on the size of data and requirements.

    Python has several packages (libraries) that read JSON, writes CSV and transform it in-memory on the fly. For example, I have a Python script that reads JSON and writes it to a CSV flat file, which is then absorbed by SSIS.

    In play, this is how easy it is to load JSON in Python. It's done with one line of code. In this example, I'm loading JSON into the result variable from a http return that is sent from a API that only outputs JSON.

    result = json.load(urllib2.urlopen(request).fp)

    Once you have JSON data loaded, you can call your CSV modules like csv.writer and iterate over your JSON data with a simple FOR LOOP to export JSON rows to CSV rows. It's very easy and simple to do that can be independent or embedded inside your SSIS packages.

    For example, this is how easy it is to print JSON data to screen. Instead of printing it, you can write it to CSV using the CSV module in python:

    for section in result["values"]:

    for row in section:

    print row

    Pretty simple to print JSON data you read. How you can write it to CSV is a little more code, but not too complex. You can start by instead of printing the row, but appending it to an empty array that can be fed into the CSV module you loaded so it can be written to a CSV file.

    Extra Credit

    This is a pretty straight forward way to load JSON and export it to CSV. Once you have data, you can also clean and confirm it with another module called Pandas. It allows you to approach transforming data in the same fashion as you would with SQL code. You can JOIN, MERGE, UNION, SELECT, UPDATE, DELETE the data all in memory. You can take the JSON data and completely transform it in efforts to AUGMENT the ETL away from SSIS or SQL Server.

    But be careful, in larger data sets, it can be troublesome. Splitting the data into manageable pieces is the best approach.

  • Although you can technically store unstructured data in SQL Server, the question is: Why would you?

    If you're ultimately trying to import facts and dimentions (as properly defined by the business) from JSON or XML into a relational or star-schema modeled database, the thing you have to understand is that the schema of your source data is going to be very fluid. The app developers are going to be giving you constant headaches with this. Personally, I would investigate using MongoDB as the primary persistance layer for JSON objects. You can also leverage 3rd party SSIS data flow tasks for performing ETL from MongoDB into SQL Server as needed.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (7/6/2016)


    Although you can technically store unstructured data in SQL Server, the question is: Why would you?

    If you're ultimately trying to import facts and dimentions (as properly defined by the business) from JSON or XML into a relational or star-schema modeled database, the thing you have to understand is that the schema of your source data is going to be very fluid. The app developers are going to be giving you constant headaches with this. Personally, I would investigate using MongoDB as the primary persistance layer for JSON objects. You can also leverage 3rd party SSIS data flow tasks for performing ETL from MongoDB into SQL Server as needed.

    Where are you getting it's unstructured? JSON is highly structured. It's just able to twist, bend and mend around data to alter it's structure to accept change. This does not mean it's not structured. It's structure is just not consistent. Unstructured data is like word documents, email messages and so forth. The structure follows no rules. JSON follows rules. You as the DBA just don't have all the tools or maybe the easiest approach to deal with those rules because they allow for static to become dynamic (i.e.: the schema shifts).

    The thing is, in most data warehouse scenarios, data will change. Our job is to grab hold of that change, ingest it, transform it and store it in a way that meets a static model that does not always change. In cases like those, throwing out the idea that you should spin up a entirely new data engine that is unfamiliar to the business is a bad idea, especially if the data is only 1 MB every week for the next 2 years.

    There are plenty of tools that are able to deal with data that is structured--that follows a certain set of rules--that can shift. SQL Server has a few, I listed Python as one and so forth. Ideally, the best approach outside of interpreting JSON is just getting everyone on par with consistenty with the chosen data formats they know and love. Not a complex service that now has to be managed or even supported by a new hire that is familiar with it.

  • My thoughts: If it's a set structure that needs to be queried often, the deserialize it into relational and store it properly. SQL Server works better with traditional row/col storage at volume.

    If it's rarely queried, then I think unstructured storage (JSON/XML/etc) inside a SQL varchar/varbinary is fine.

    If it's low volume, do what's easiest. It won't matter enough.

  • Steve Jones - SSC Editor (7/6/2016)


    If it's low volume, do what's easiest. It won't matter enough.

    Heh... unless the low volume data also happens to be the most critical.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • xsevensinzx (7/6/2016)


    Eric M Russell (7/6/2016)


    Although you can technically store unstructured data in SQL Server, the question is: Why would you?

    If you're ultimately trying to import facts and dimentions (as properly defined by the business) from JSON or XML into a relational or star-schema modeled database, the thing you have to understand is that the schema of your source data is going to be very fluid. The app developers are going to be giving you constant headaches with this. Personally, I would investigate using MongoDB as the primary persistance layer for JSON objects. You can also leverage 3rd party SSIS data flow tasks for performing ETL from MongoDB into SQL Server as needed.

    Where are you getting it's unstructured? JSON is highly structured. It's just able to twist, bend and mend around data to alter it's structure to accept change. This does not mean it's not structured. It's structure is just not consistent. Unstructured data is like word documents, email messages and so forth. The structure follows no rules. JSON follows rules. You as the DBA just don't have all the tools or maybe the easiest approach to deal with those rules because they allow for static to become dynamic (i.e.: the schema shifts).

    The thing is, in most data warehouse scenarios, data will change. Our job is to grab hold of that change, ingest it, transform it and store it in a way that meets a static model that does not always change. In cases like those, throwing out the idea that you should spin up a entirely new data engine that is unfamiliar to the business is a bad idea, especially if the data is only 1 MB every week for the next 2 years.

    There are plenty of tools that are able to deal with data that is structured--that follows a certain set of rules--that can shift. SQL Server has a few, I listed Python as one and so forth. Ideally, the best approach outside of interpreting JSON is just getting everyone on par with consistenty with the chosen data formats they know and love. Not a complex service that now has to be managed or even supported by a new hire that is familiar with it.

    Ha! I'm getting "unstructured" from the title of original post and the OP also said he has no specifications for the data. At this point he's "trying to make life easy for the developers". If what the end users need is a temporary persistance layer for their objects, something like a shopping cart for an eCommerce application, then that data doesn't belong in the data warehouse. Maybe these files can add some semantic value to the data warehouse, or maybe not.

    It also depends on whether the end users need to retreive the data in it's original JSON format and how transient the data is. It could be that the best solution is simply to stage the documents to a network folder as .json files having session ID for the file name. That's very efficient in terms of I/O, it's scalable, and it's flexible.

    Let the business define where they want to go with all this before proceeding. As an analogy, if you're the manager of a bakery and corporate HQ drops an airplane propeller on the loading dock with no specific instructions, do you drag it inside the kitchen, or do you just leave it on the loading dock pending further information?

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • JSON is really easy to shred in Powershell (above vs 2). That being said, if the data source isn't consistent, you'll always have problems.

  • Eric M Russell (7/6/2016)


    xsevensinzx (7/6/2016)


    Eric M Russell (7/6/2016)


    Although you can technically store unstructured data in SQL Server, the question is: Why would you?

    If you're ultimately trying to import facts and dimentions (as properly defined by the business) from JSON or XML into a relational or star-schema modeled database, the thing you have to understand is that the schema of your source data is going to be very fluid. The app developers are going to be giving you constant headaches with this. Personally, I would investigate using MongoDB as the primary persistance layer for JSON objects. You can also leverage 3rd party SSIS data flow tasks for performing ETL from MongoDB into SQL Server as needed.

    Where are you getting it's unstructured? JSON is highly structured. It's just able to twist, bend and mend around data to alter it's structure to accept change. This does not mean it's not structured. It's structure is just not consistent. Unstructured data is like word documents, email messages and so forth. The structure follows no rules. JSON follows rules. You as the DBA just don't have all the tools or maybe the easiest approach to deal with those rules because they allow for static to become dynamic (i.e.: the schema shifts).

    The thing is, in most data warehouse scenarios, data will change. Our job is to grab hold of that change, ingest it, transform it and store it in a way that meets a static model that does not always change. In cases like those, throwing out the idea that you should spin up a entirely new data engine that is unfamiliar to the business is a bad idea, especially if the data is only 1 MB every week for the next 2 years.

    There are plenty of tools that are able to deal with data that is structured--that follows a certain set of rules--that can shift. SQL Server has a few, I listed Python as one and so forth. Ideally, the best approach outside of interpreting JSON is just getting everyone on par with consistenty with the chosen data formats they know and love. Not a complex service that now has to be managed or even supported by a new hire that is familiar with it.

    Ha! I'm getting "unstructured" from the title of original post and the OP also said he has no specifications for the data. At this point he's "trying to make life easy for the developers". If what the end users need is a temporary persistance layer for their objects, something like a shopping cart for an eCommerce application, then that data doesn't belong in the data warehouse. Maybe these files can add some semantic value to the data warehouse, or maybe not.

    It also depends on whether the end users need to retreive the data in it's original JSON format and how transient the data is. It could be that the best solution is simply to stage the documents to a network folder as .json files having session ID for the file name. That's very efficient in terms of I/O, it's scalable, and it's flexible.

    Let the business define where they want to go with all this before proceeding. As an analogy, if you're the manager of a bakery and corporate HQ drops an airplane propeller on the loading dock with no specific instructions, do you drag it inside the kitchen, or do you just leave it on the loading dock pending further information?

    Sorry, jumped off the deep end a bit there. Bit frustrated today with other things not relating.

    I would totally shoot for consistency and tackle JSON as is unless there is a good option to get a format that is more to your liking. While you can handle dynamic changes, especially in scenarios where the fields of data are pretty consistent, but the order and the addition of new ones is the only dynamic portion, having an idea of what's coming is pretty critical all around and nothing to do specifically with JSON. This happens with any structured formatted data source. That's ideally what I was trying to say.

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply