I’m told by friends who provide enterprise-level hardware for large applications that they are seeing a significant trend away from the huge ‘database-scale’ hardware behemoths towards commodity hardware. NoSQL Database systems such as MongoDB are being selected despite their relative immaturity, purely on the fact that they are so easy to justify on grounds of cost-saving. They can use shedloads of commodity hardware clusters, so can scale up so much more cheaply. It can cost ten times as much to expand capacity of a relational system.
Of course, reality is different, because there are other costs. The typical NoSQL solutions have been refined for web applications rather than a busy OLTP task. They are best with offline processing and analysis of large-scale data that doesn’t change much, and where the underlying schema hasn’t been figured out. It doesn’t have the facilities required for commerce such as transactionality. All the services that SQL provides, if they are required, need to be written now at the application level by the developers; whether that is aggregation, joins, scheduled jobs, ETL or transactions. This is all likely to add costs into the equation.
Abandoning SQL can be painful for other reasons. Working with MapReduce, for example, is a cultural shock to any SQL developer because it is more like writing execution plans by hand. Whereas, with SQL, you just declare what result you want, no matter what the size and distribution of the data, MapReduce required hand-crafted strategies that would need to be refined as the data changes. I stopped doing that sort of thing over twenty years ago.
Could the answer be to add SQL to ‘BigData’ systems such as Hadoop? There are already many SQL On Hadoop solutions such as Hive, Impala, Stinger or Hadapt but . In a new twist, the Hadoop stack is now host to a full-featured SQL database called ‘Splice Machine’, capable of running transactions and analytics simultaneously.
The clever thing about the Splice Machine, which uses Apache Derby as its database engine, is that the SQL planner and optimizer has been extended to create execution plans that can make use of Hadoop's parallel architecture so that processing is done at the individual node. The intermediate data is then spliced back together to create the result.
The early signs are that we can now run a SQL-based relational database with distributed execution plans over commodity hardware, leaving just the task of splicing together of the result to the engine itself, then why can’t Microsoft or Oracle do it? The dream of a linear cost for scaling a relational database now seems attainable, and it looks like the open source community may have somehow got there first.