• xsevensinzx - Wednesday, February 27, 2019 6:28 PM

    Sigh, not necessarily. Having that SQL interface on top of the document stores allows us to have full transparency of the data as it lands on disk on the way up to a gold standard like the data warehouse. It's insanely powerful and something some RDBMS cannot offer in terms of opening doorways into the raw data BEFORE it's cleaned and processed by the ETL regardless if it's Python or SSIS. If you want to talk about being lean, this is what helps you get there.

    It also offers a way to use the data BEFORE it's completely processed by the SQL/DBA/Whatever team, which can often have a long turnaround time. For example, all of my data is ingested using Python as documents to a document store. As soon as the data lands on disk, it's accessible to the end user via these SQL Interfaces you're smashing. It's also accessible to the DBA team via Polybase with EXTERNAL TABLES both with the data warehouse and all the data marts using Azure DB (which also supports EXTERNAL QUERIES to the data warehouse, which allows you to bridge the data lake). This is insanely powerful and allows users to start working with the data before the technical teams can fully integrate said data into a proper data warehouse.

    I can't say all of this is due to scalability. It's more due to accessibility and being lean with the data. One of the most common pain points with the RDBMS is it's too restricted, and for good reason. With solutions above, it's often opening up the floodgates to the rest of the organization to start getting dirty with the data in order to help the business understand what they need to do with the data BEFORE it goes to the data warehouse. Now, I can't speak to the reasons why some organizations skip using a data warehouse versus others, that's a bit meh, but smashing the idea of exposing the data outside of the data warehouse is a bit well, silly if you ask me.

    Please don't misunderstand me. I have no problem with the use cases you mentioned nor do I have an issue with an SQL interface on these environments. The value in getting the data loaded and investigating what's there and what can be done is quite invaluable. The issue I have is when those techniques are implemented in a production environment in lieu of a relational database when the data being loaded is relational (most come from another RDBMS) and the tools are implemented to operate as though it were an RDBMS. It was set up like a Rube-Goldberg machine with a complex set of moving parts all to avoid using a database. That appeared to be a requirement as the conversations kept circling around their "Big Data" solution. The volume of data was not that large that it couldn't have easily set in a Postgres database (assuming they had to stay in the AWS space). The approach was decided upon long before any data analysis was performed. When the Enterprise Architect and I kept asking why we couldn't just use a database instead of the complicated process and we were told that our suggestion wasn't the "Big Data" way. Time to implementation wasn't a concern because it's taken over 2 years to implement the solution using their framework.

    Clearly the platform was chosen long before any analysis was done to determine the correct solution. This leads to all sorts of problems. I have witnessed a project that failed miserably because SQL Server was used for document management instead of a more appropriate platform like MongoDB. Each platform has its strengths and weaknesses, what it does well and what it does not do well. Using the wrong tool for the job only invites mediocre results at best.