• Aaron N. Cutshall - Wednesday, February 27, 2019 8:22 AM

    What really gets me is that so many of the NoSQL databases now offer an SQL interface. I have seen folks go to great lengths to avoid using a relational database yet the end result could be easily served by one. One project I was involved with was on AWS with highly structured delimited files in S3 and used Glue and SparkSQL on top of that to enable them to "query" the data. The source data were mostly from other relational databases (SQL Server, Oracle and DB2) which they dumped into S3 with a file per table. They wrote Python scripts to ETL the data into other S3 files that were also highly structured and documented with Glue so SparkSQL could query them. They did everything to avoid using PostgreSQL and RedShift yet wound up creating a very complex and convoluted system that approximated a database. Their arguments for not using an RDBMS was that it needed to be "scalable and fast" yet was really neither.

    Sigh, not necessarily. Having that SQL interface on top of the document stores allows us to have full transparency of the data as it lands on disk on the way up to a gold standard like the data warehouse. It's insanely powerful and something some RDBMS cannot offer in terms of opening doorways into the raw data BEFORE it's cleaned and processed by the ETL regardless if it's Python or SSIS. If you want to talk about being lean, this is what helps you get there.

    It also offers a way to use the data BEFORE it's completely processed by the SQL/DBA/Whatever team, which can often have a long turnaround time. For example, all of my data is ingested using Python as documents to a document store. As soon as the data lands on disk, it's accessible to the end user via these SQL Interfaces you're smashing. It's also accessible to the DBA team via Polybase with EXTERNAL TABLES both with the data warehouse and all the data marts using Azure DB (which also supports EXTERNAL QUERIES to the data warehouse, which allows you to bridge the data lake). This is insanely powerful and allows users to start working with the data before the technical teams can fully integrate said data into a proper data warehouse.

    I can't say all of this is due to scalability. It's more due to accessibility and being lean with the data. One of the most common pain points with the RDBMS is it's too restricted, and for good reason. With solutions above, it's often opening up the floodgates to the rest of the organization to start getting dirty with the data in order to help the business understand what they need to do with the data BEFORE it goes to the data warehouse. Now, I can't speak to the reasons why some organizations skip using a data warehouse versus others, that's a bit meh, but smashing the idea of exposing the data outside of the data warehouse is a bit well, silly if you ask me.