The RDBMS is Often the Right Tool

  • Comments posted to this topic are about the item The RDBMS is Often the Right Tool

  • What really gets me is that so many of the NoSQL databases now offer an SQL interface. I have seen folks go to great lengths to avoid using a relational database yet the end result could be easily served by one. One project I was involved with was on AWS with highly structured delimited files in S3 and used Glue and SparkSQL on top of that to enable them to "query" the data. The source data were mostly from other relational databases (SQL Server, Oracle and DB2) which they dumped into S3 with a file per table. They wrote Python scripts to ETL the data into other S3 files that were also highly structured and documented with Glue so SparkSQL could query them. They did everything to avoid using PostgreSQL and RedShift yet wound up creating a very complex and convoluted system that approximated a database. Their arguments for not using an RDBMS was that it needed to be "scalable and fast" yet was really neither.

  • They do have a SQL interface, and I'm not sure if this is good or bad. It might be that SQL is just a better way of querying, or at least, more familiar. If they continue to use the underlying graph query, then things might perform better. Though, if we're querying via SQL, the complexity returns.

    Many have added additional indexing as well, to get around the problems of querying multiple entities and instances. That's where NoSQL type structures struggled. They were much, much better for singleton type queries, but struggled with aggregation. I think some of the indexing now reduces scalability, which to me, means the developers ought to have learned more about RDBMS work in the first place rather than struggling against it.

    I do know this might be less about tech and more about people (data professionals) being difficult with rapidly requested changes.

  • I agree that NoSQL environments have their niches and work well within those, I believe that too many problems arise when folks try to use NoSQL environments for problems that are easily solved with an RDBMS. My dad always taught me to "use the right tool for the job" and I am convinced that applies to data projects as well. We should use NoSQL only where appropriate just as we should use RDBMS only where appropriate. I have seen folks use RDBMS systems for tasks that should have been handled by NoSQL platforms. Just because you CAN use a tool for something doesn't mean it's best for it. Using an RDBMS to track documents is like using a screwdriver for a chisel. Not really the best tool because if it works at all it will be sub-par at best.

  • I think the first thing to consider is that as Steve pointed out most of us aren't operating at the Google/Facebook/Spotify/Netflix scale. It might also help to realize that it isn't always scale. For instance, web pages have only a "sort of" schema. What schema is defined isn't enforced with browsers being extremely adept at displaying in spite of violation of what should (or shouldn't) be present. In the document world, the lack of hard schema makes a lot of sense. For that matter, if I were dealing with something like Facebook I also might look for something outside the traditional SQL database.

    Probably one should start in thinking about how their data relates to other data. Most situations have some defined relationships that follow reasonably easy rules. Further, most of the time the easier on developers is only true at the early stages. Later, the lack of structure comes home to roost and the developer moves to another area pretending they didn't leave a mess behind. They then get to complain about the mess the other developer (just like them) left behind.

  • Aaron N. Cutshall - Wednesday, February 27, 2019 8:22 AM

    What really gets me is that so many of the NoSQL databases now offer an SQL interface. I have seen folks go to great lengths to avoid using a relational database yet the end result could be easily served by one. One project I was involved with was on AWS with highly structured delimited files in S3 and used Glue and SparkSQL on top of that to enable them to "query" the data. The source data were mostly from other relational databases (SQL Server, Oracle and DB2) which they dumped into S3 with a file per table. They wrote Python scripts to ETL the data into other S3 files that were also highly structured and documented with Glue so SparkSQL could query them. They did everything to avoid using PostgreSQL and RedShift yet wound up creating a very complex and convoluted system that approximated a database. Their arguments for not using an RDBMS was that it needed to be "scalable and fast" yet was really neither.

    Sigh, not necessarily. Having that SQL interface on top of the document stores allows us to have full transparency of the data as it lands on disk on the way up to a gold standard like the data warehouse. It's insanely powerful and something some RDBMS cannot offer in terms of opening doorways into the raw data BEFORE it's cleaned and processed by the ETL regardless if it's Python or SSIS. If you want to talk about being lean, this is what helps you get there.

    It also offers a way to use the data BEFORE it's completely processed by the SQL/DBA/Whatever team, which can often have a long turnaround time. For example, all of my data is ingested using Python as documents to a document store. As soon as the data lands on disk, it's accessible to the end user via these SQL Interfaces you're smashing. It's also accessible to the DBA team via Polybase with EXTERNAL TABLES both with the data warehouse and all the data marts using Azure DB (which also supports EXTERNAL QUERIES to the data warehouse, which allows you to bridge the data lake). This is insanely powerful and allows users to start working with the data before the technical teams can fully integrate said data into a proper data warehouse.

    I can't say all of this is due to scalability. It's more due to accessibility and being lean with the data. One of the most common pain points with the RDBMS is it's too restricted, and for good reason. With solutions above, it's often opening up the floodgates to the rest of the organization to start getting dirty with the data in order to help the business understand what they need to do with the data BEFORE it goes to the data warehouse. Now, I can't speak to the reasons why some organizations skip using a data warehouse versus others, that's a bit meh, but smashing the idea of exposing the data outside of the data warehouse is a bit well, silly if you ask me.

  • xsevensinzx - Wednesday, February 27, 2019 6:28 PM

    Sigh, not necessarily. Having that SQL interface on top of the document stores allows us to have full transparency of the data as it lands on disk on the way up to a gold standard like the data warehouse. It's insanely powerful and something some RDBMS cannot offer in terms of opening doorways into the raw data BEFORE it's cleaned and processed by the ETL regardless if it's Python or SSIS. If you want to talk about being lean, this is what helps you get there.

    It also offers a way to use the data BEFORE it's completely processed by the SQL/DBA/Whatever team, which can often have a long turnaround time. For example, all of my data is ingested using Python as documents to a document store. As soon as the data lands on disk, it's accessible to the end user via these SQL Interfaces you're smashing. It's also accessible to the DBA team via Polybase with EXTERNAL TABLES both with the data warehouse and all the data marts using Azure DB (which also supports EXTERNAL QUERIES to the data warehouse, which allows you to bridge the data lake). This is insanely powerful and allows users to start working with the data before the technical teams can fully integrate said data into a proper data warehouse.

    I can't say all of this is due to scalability. It's more due to accessibility and being lean with the data. One of the most common pain points with the RDBMS is it's too restricted, and for good reason. With solutions above, it's often opening up the floodgates to the rest of the organization to start getting dirty with the data in order to help the business understand what they need to do with the data BEFORE it goes to the data warehouse. Now, I can't speak to the reasons why some organizations skip using a data warehouse versus others, that's a bit meh, but smashing the idea of exposing the data outside of the data warehouse is a bit well, silly if you ask me.

    Please don't misunderstand me. I have no problem with the use cases you mentioned nor do I have an issue with an SQL interface on these environments. The value in getting the data loaded and investigating what's there and what can be done is quite invaluable. The issue I have is when those techniques are implemented in a production environment in lieu of a relational database when the data being loaded is relational (most come from another RDBMS) and the tools are implemented to operate as though it were an RDBMS. It was set up like a Rube-Goldberg machine with a complex set of moving parts all to avoid using a database. That appeared to be a requirement as the conversations kept circling around their "Big Data" solution. The volume of data was not that large that it couldn't have easily set in a Postgres database (assuming they had to stay in the AWS space). The approach was decided upon long before any data analysis was performed. When the Enterprise Architect and I kept asking why we couldn't just use a database instead of the complicated process and we were told that our suggestion wasn't the "Big Data" way. Time to implementation wasn't a concern because it's taken over 2 years to implement the solution using their framework.

    Clearly the platform was chosen long before any analysis was done to determine the correct solution. This leads to all sorts of problems. I have witnessed a project that failed miserably because SQL Server was used for document management instead of a more appropriate platform like MongoDB. Each platform has its strengths and weaknesses, what it does well and what it does not do well. Using the wrong tool for the job only invites mediocre results at best.

  • Aaron N. Cutshall - Wednesday, February 27, 2019 7:58 PM

    The issue I have is when those techniques are implemented in a production environment in lieu of a relational database when the data being loaded is relational (most come from another RDBMS) and the tools are implemented to operate as though it were an RDBMS.

    I understand that, but you have mentioned this numerous times in similar threads. The techniques has nothing to do with how good or bad the technology is nor does it mean it's the wrong tool for the job. Anyone can misuse a hammer when trying to hammer in a nail to a board. It does not mean the hammer is not a good tool or the right tool for the job.

    The same also applies for other tools. I have seen plenty of people misuse their RDBMS. Does not mean it's a bad tool or the wrong tool.

    I know you followed up saying similar to Steve, but just have to point that out because it seems to be a constant theme here on how people use these tools. Yes, people use them differently. Some good, some bad. The RDBMS is not the end all be all solution nor is it the only tool in the toolbox. And with that, I would go on to say that the RDBMS is still here to stay and has a critical part to these other tools for the sheer fact it does what it was designed to do so well IF the operators behind the tool know what they are doing.

  • Aaron N. Cutshall - Wednesday, February 27, 2019 8:22 AM

    What really gets me is that so many of the NoSQL databases now offer an SQL interface. I have seen folks go to great lengths to avoid using a relational database yet the end result could be easily served by one. One project I was involved with was on AWS with highly structured delimited files in S3 and used Glue and SparkSQL on top of that to enable them to "query" the data. The source data were mostly from other relational databases (SQL Server, Oracle and DB2) which they dumped into S3 with a file per table. They wrote Python scripts to ETL the data into other S3 files that were also highly structured and documented with Glue so SparkSQL could query them. They did everything to avoid using PostgreSQL and RedShift yet wound up creating a very complex and convoluted system that approximated a database. Their arguments for not using an RDBMS was that it needed to be "scalable and fast" yet was really neither.

    Why does it surprise you that they "now" offer an SQL interface? - remember NOSQL is short for Not Only SQL - that should make it clear why they not only may but rather must offer an SQL interface.

  • It was set up like a Rube-Goldberg machine with a complex set of moving parts all to avoid using a database.

    Theres the subtle implication that you can't build rube Goldberg machines with SQL and I think that's a bit of a stretch. (source, personal experience)

  • xsevensinzx - Wednesday, February 27, 2019 11:35 PM

    I understand that, but you have mentioned this numerous times in similar threads. The techniques has nothing to do with how good or bad the technology is nor does it mean it's the wrong tool for the job. Anyone can misuse a hammer when trying to hammer in a nail to a board. It does not mean the hammer is not a good tool or the right tool for the job.

    The same also applies for other tools. I have seen plenty of people misuse their RDBMS. Does not mean it's a bad tool or the wrong tool.

    I know you followed up saying similar to Steve, but just have to point that out because it seems to be a constant theme here on how people use these tools. Yes, people use them differently. Some good, some bad. The RDBMS is not the end all be all solution nor is it the only tool in the toolbox. And with that, I would go on to say that the RDBMS is still here to stay and has a critical part to these other tools for the sheer fact it does what it was designed to do so well IF the operators behind the tool know what they are doing.

    You're right that I didn't cover poor uses of an appropriate tool as I also have seen that many times. Thanks for pointing that out as an issue as well.

  • crmitchell - Thursday, February 28, 2019 2:41 AM

    Why does it surprise you that they "now" offer an SQL interface? - remember NOSQL is short for Not Only SQL - that should make it clear why they not only may but rather must offer an SQL interface.

    Originally NoSQL meant just that, no relational database and hence no SQL. It's quite true that mentality has changed to include SQL interfaces and the acronym has bee changed to "Not Only SQL". I just think the irony of the about face is funny.

  • patrickmcginnis59 10839 - Thursday, February 28, 2019 6:52 AM

    Theres the subtle implication that you can't build rube Goldberg machines with SQL and I think that's a bit of a stretch. (source, personal experience)

    My apologies as no such implication was intended. As mentioned previously, correct tools can be used incorrectly as well with equally poor results.

  • Aaron N. Cutshall - Thursday, February 28, 2019 8:18 AM

    crmitchell - Thursday, February 28, 2019 2:41 AM

    Why does it surprise you that they "now" offer an SQL interface? - remember NOSQL is short for Not Only SQL - that should make it clear why they not only may but rather must offer an SQL interface.

    Originally NoSQL meant just that, no relational database and hence no SQL. It's quite true that mentality has changed to include SQL interfaces and the acronym has bee changed to "Not Only SQL". I just think the irony of the about face is funny.

    The NoSQL started because some folks thought they had a need and on the NoSQL side they're trying new things to address these things. Bringing in facets of SQL shouldn't be a source of ridicule, but honestly if folks want to call reconsiderations of technical direction "ironic" and "funny" then that's fine, its much easier to point at failures than it is to actually try new things and fail. Not every attempt at invention succeeds, but shouldn't these attempts still be made in any case?

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply