• Darren Wallace (11/20/2012)


    Ultimately any task where you care more about high speed and low cost than you do about consistency is a great candidate for NoSQL.

    That pretty much sums it up, but I'll throw in my 2c anyway 🙂

    Reccomendations & Time-series/forcasting: These are data-mining tasks where you can certainly use an RDBMS as a data-store. Both of these tasks can be done using built-in algorithms in SSAS projects with data sourced straight from SQL Server or if you like a Cube. Other data-mining packages like SAS, SPSS, Statistica, R and even tools like ggobi or weka expect their input data to be structured, at the very least as delimited flat files. If you are feeding unstructured data into these tools you will first need to do a lot of structuring! Pre-preparation of data is 99% of the work in data-mining: binning, grouping, binarising... these are all essentially structuring techniques that are compulsory for some algorithms. You can do that from NoSQL, sure, but you can't get away from the need to first structure the data. What can you really do with unstructured data? Store it, view it that's about all. Interpreting the data on the fly still counts as structuring.

    Admittedly, Amazon is the king of the "recommendation" and they use their own NoSQL datastore, but when they were getting started distributed RDBMS were fairly primitive so it's no wonder they didn't use something off the shelf. When you are a big company that specialises in big-data and derives immense value directly from the clever handling of that data then you need proprietary competitive advantage, and can justify the cost.

    Their EC2 service now lets you spin up big clusters of SQL server (and other RDBMS) and I wonder why they would do that if NoSQL were the only way? Kimball's whitepaper on big data suggests that when you are "drinking from the firehose" (I love that visceral imagery) that NoSQL makes sense as an initial data store. With Microsoft investigating Hadoop I can see an extended kimball process of capture to NoSQL --> ETL (with structuring) to RDBMS --> BI/Reporting / Analytics makes sense for most companies. If you want to skip the middle bit then you'll no doubt be using a team of java gurus and rolling your own analytics and reporting layer, and that's where it's going to cost a lot. I know it sounds counter-intuitive to say the shorter process will cost more but I think shortly Microsoft will provide easy tools to do ETL from NoSQL to RDBMS as a staging process. They might even plug BI tools directly into NoSQL but I find that less likely, the analogy is doing BI directly from OLTP - which is problematic and been well covered. Again it comes back to latency and how real-time you need to be and how much it's that's all worth to you.

    LDAP: LDAP queries are not fun. Having tried to query AD via a linked server using both SQL and LDAP syntax I can tell you that it is like pulling teeth. I'd use some existing package for that any day. That's not quite the same thing as storing ACL data in an RDBMS though.

    The media repository argument is interesting. They way blobs are handled in RDBMS seems to flip-flop between storing them in tables versus storing pointers to file system objects and there doesn't seem to be any agreement about the best way to do it. I suppose something closer to the filesystem naturally makes sense, but it's all about how you index and search for them. I'm interested in what develops there.