A Tool for the Job

  • Comments posted to this topic are about the item A Tool for the Job

    Best wishes,
    Phil Factor

  • Search is one of those things where you'd expect an RDBMS to be much stronger than it actually is. Any form of data repository is only as good as the mechanisms by which you can find and retrieve stuff.

    I haven't had time to experiment with SQL2012 Semantic Search yet but certainly up to SQL2008R2 full-text search didn't do quite what I wanted it to do.

    The other point is are the products that we think of as RDBMS just RDBMSs?

    Yes they retain that core strength but are they now becoming metadata management systems, in which case

    I agree with the author on the following

    • Log Analysis
    • Email:w00t:
    • ACL
    • High Frequency Training - That's why M$ Stream Insight is so compelling

    I think the following fall under the "It Depends" category:-

    • Search - Lots of pain with external search and keeping it up-to-date
    • Media Repository - Remote Blob Storage
    • Recommendations - It really depends what you are trying to do
    • Product Catalog - Didn't understand his point

    The rest of it I don't have an opinion on.

    As a general comment I've had developers say to me "this can't be done in SQL Server" or "SQL Server isn't good at doing 'x'". When I've looked at the particular problems not only can it be done but it can be done well it was just a case that they lacked the knowledge/experience to solve their problem.

    One thing you have to watch out for with NoSQL are the hidden costs. Yes they can scale to Jupiter and back but expect to have to scale out much earlier and with the associated costs.

    Obviously you can run your systems in the cloud but that decision brings its own complexities.

  • Looking at the list from an SQL Server RDBMS view:

    Search: I agree that there are products that are much better than the built in FT-engine in SQL Server, but if your FT needs are simple (simple stemming and ranking) then the built in is good enough. I have played a little bit with Apache Solr and it does have a LOT mure functionality (better stemming, faceting, ranking, typing correction, suggestions etc) than SQL Server FT, but it takes a lot more time to configure (xml files, xml files, xml files). Then there is the issue of keeping an external FT engine in sync with the RDBMS.

    Times series/high frequency logging: Definately not a job for a RDBMS. Sure you can get decent perfomance for the actual logging for a small number of tags, but when it comes to reading this data and performing calculations on them, performance goes down the drain. A product like OSIsoft PI can easily store 10, even 100 of thousands of values a second on reasonable hardware. It is designed for this purpose only, and it does it well.

    Storing large amouns of unstructede data: Agreed. Not something a RDBMS can do well.

    ACL: This is actually a bit funny. We implemented ACL (groups and accounts synced from AD with FIM every five minutes) on our internally developed document management system (running SQL Server 2005) about nine months ago.

    The requirements: A user should be able to grant or deny permissions (read, write, read ACL, write ACL) to multiple groups or users on any case, document or file (not a real file, just our internal representation of a file). A document belong to a case, and a file to a revision on a document. Permissions should be inherited from case to document to file if the "inherit permissions" on the document or file is set. Both direct and indirect group memberships should work. The hardest part: Performance from a user view must not be affected, and the ACL inheritance from case to document to file should be done instantly.

    An ACL change on a case can propagate to as much as 50.000 documents and a few houndred thousand files. How can this be done without affecting performance (again, from a user view). We didn't want the user to wait until the ACL had propagated to all objects that inherit from this object. Have you ever tried making a change to a root folder with thousands of subfolders and files in Windows? Takes forever.

    The solution: We decided to use Service Broker. When ACL on a case is modified we simply send a message with the case id (or document id) to a Service Broker service. The message is processed and document ACLs to be updated are split in to batches of 100 docs. When a batch of document ACLs are modified more messages are send and processed to update ACL on the files. This solved the user performance problem and the "instantly" requirement was good enough. Propagating ACL to 50.000 documents and files only take a few seconds.

  • I look at articles like this and must admit I think the author is somewhat full of it. I just wish I had the experience to really assess the NoSQL technologies and how they operate. @david-2 Poole I think you are approaching what he says fairly reasonably - I've tackled the issues on most points in SQL Server and not had too much of a nightmare. Guess my data has just not been big enough *shrug*

    Admittedly search is terrible in SQL Server. I've sidestepped that often (not tried 2012 facilities as of yet). But product cataloguing? There's plenty of adequate approaches - from the canonical Celko to my own simpler version(s). High frequency trading, well, I've not been there.

  • Search is tough but can be made to work well in an RDBMS. I was once tasked with doing a database that took every major news story from around the world published on the internet and archived them so the text could be displayed on a website. I was asked to provide 'google-style' search, and an automatic feature that listed the most widely covered stories in the day. It was SQL Server 2000 at the time, and the archive was immense. I won't say it was easy, but it worked eventually. With five years of news stories and just under a terabyte of data, it was never taking more than a third of a second. It became easier with SQL Server 2005, of course, and I used the hard-won techniques for several other applications.

    Best wishes,
    Phil Factor

  • SQL Server has actually had NoSQL data storage and analysis products bundled with it for over a decade. Take for example, Analysis Services. People don't generally think about OLAP products being a NoSQL database engine, but they actually are. Also, Full-Text Search, FileStream, and Common Language Runtime services can maintain huge unstructured data stored external to the relational (mdf, ndf) data files, but are tightly integrated with the RDMS engine and T-SQL language.

    If your database is storing huge unstructured chunks of junk in varchar columns, then SQL Server itself has other services bundled with it that can offer a better solutuon, without resorting to a 3rd party product.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • So email makes the list. To be fair to author I am in the analysis business and storage or platform beyond baseline criteria are of little interest. My objective is to get information out of data and with email (or other unstructured data) the approach I take is exactly the same as with structured data. I start by drawing the decision trees my customers are going to use to drill into the data. These make it really easy for me to define the 'meta data' or architecture for the data I'm going to warehouse. The final step is coding up the load process which I quite like because I'm working with a two field table. Just in case I've lost anybody here's an example; my users inquiries will be for the most part related to master data; that is to say they will be looking for email with SKU number, shipment, purchase order, contract etc. When I load the data I search for text containing the master data key field and populate results to the index tables. My email is now firmly in the relational space. I understand the "what if" but as long as I monitor search traffic and make sure the vast majority of requests are within my structure I've dodged the pink slip for another day.

  • Saying that you are implementing a NoSQL approach really means that you are removing an abstraction layer from your data model. The only way I can think of to describe this is to say that you are instead depending on services provided by the operating system. I agree completely that loading a server log into a database table is not the best way to access it. Notice then that you are going to be using disk file IO.

    Of course such services, being lower level, can easily be faster. They will also be more complex and require more time and skill to implement and maintain. They will also be less robust in terms of failures and error reporting. In my world of business software development, such systems are implemented by engineers. When I need such functionality, I look for a packaged product, and I personally would not touch such a system, professionally, with a ten foot pole.

  • RDBMS or NoSQL is irrelevant when you get the wrong answers

    Recommendations e.g. People who bought torches often bought batteries too

    we all know 😀 they need matches not batteries. :w00t:

    <><
    Livin' down on the cube farm. Left, left, then a right.

  • Thank you for the interesting topic, and the link to Andrew C. Oliver's article, which had other links I followed. I have almost zero experience in any of these tasks, and have been wondering what other databases are out there for NoSQL solutions. I am looking forward to reading more comments on this thread.

  • I've always accepted that all the tasks we want our digital machines to do are going to vary in the methodologies best used to implement them. I'm also hopefull that SQL folks at some point in their career come to the same conclusion as it would seem to be obvious to me, but I'm far from certain that this will be the case.

  • Ultimately any task where you care more about high speed and low cost than you do about consistency is a great candidate for NoSQL. However, I have to comment on this one.

    Users/groups and ACLs: ('To some degree, LDAP was the original NoSQL database').

    There are vulnerabilities in AD that resulted from this philosophy. For example: http://social.technet.microsoft.com/Forums/en-US/exchangesvrclientslegacy/thread/3da53460-ef76-4f01-94c9-f7b96fdaf99d

    I also find the high-frequency trading bit to be really scary. Given the number of news stories about these applications going rogue I'm not sure I'd use it as a poster child for NoSQL. Not that these issues are related to NoSQL.

    -DW

  • Darren Wallace (11/20/2012)


    Ultimately any task where you care more about high speed and low cost than you do about consistency is a great candidate for NoSQL.

    That pretty much sums it up, but I'll throw in my 2c anyway 🙂

    Reccomendations & Time-series/forcasting: These are data-mining tasks where you can certainly use an RDBMS as a data-store. Both of these tasks can be done using built-in algorithms in SSAS projects with data sourced straight from SQL Server or if you like a Cube. Other data-mining packages like SAS, SPSS, Statistica, R and even tools like ggobi or weka expect their input data to be structured, at the very least as delimited flat files. If you are feeding unstructured data into these tools you will first need to do a lot of structuring! Pre-preparation of data is 99% of the work in data-mining: binning, grouping, binarising... these are all essentially structuring techniques that are compulsory for some algorithms. You can do that from NoSQL, sure, but you can't get away from the need to first structure the data. What can you really do with unstructured data? Store it, view it that's about all. Interpreting the data on the fly still counts as structuring.

    Admittedly, Amazon is the king of the "recommendation" and they use their own NoSQL datastore, but when they were getting started distributed RDBMS were fairly primitive so it's no wonder they didn't use something off the shelf. When you are a big company that specialises in big-data and derives immense value directly from the clever handling of that data then you need proprietary competitive advantage, and can justify the cost.

    Their EC2 service now lets you spin up big clusters of SQL server (and other RDBMS) and I wonder why they would do that if NoSQL were the only way? Kimball's whitepaper on big data suggests that when you are "drinking from the firehose" (I love that visceral imagery) that NoSQL makes sense as an initial data store. With Microsoft investigating Hadoop I can see an extended kimball process of capture to NoSQL --> ETL (with structuring) to RDBMS --> BI/Reporting / Analytics makes sense for most companies. If you want to skip the middle bit then you'll no doubt be using a team of java gurus and rolling your own analytics and reporting layer, and that's where it's going to cost a lot. I know it sounds counter-intuitive to say the shorter process will cost more but I think shortly Microsoft will provide easy tools to do ETL from NoSQL to RDBMS as a staging process. They might even plug BI tools directly into NoSQL but I find that less likely, the analogy is doing BI directly from OLTP - which is problematic and been well covered. Again it comes back to latency and how real-time you need to be and how much it's that's all worth to you.

    LDAP: LDAP queries are not fun. Having tried to query AD via a linked server using both SQL and LDAP syntax I can tell you that it is like pulling teeth. I'd use some existing package for that any day. That's not quite the same thing as storing ACL data in an RDBMS though.

    The media repository argument is interesting. They way blobs are handled in RDBMS seems to flip-flop between storing them in tables versus storing pointers to file system objects and there doesn't seem to be any agreement about the best way to do it. I suppose something closer to the filesystem naturally makes sense, but it's all about how you index and search for them. I'm interested in what develops there.

  • Darren Wallace (11/20/2012)


    Ultimately any task where you care more about high speed and low cost than you do about consistency is a great candidate for NoSQL.

    Anyone have a view on data quality with regard to NoSQL? Does it help/hinder?

  • @david.Poole

    I reckon that it varies enormously with the product, and with a single product over time. In some cases, the data consistency problems haven't been exactly subtle. Did you see ...

    http://blog.engineering.kiip.me/post/20988881092/a-year-with-mongodb

    http://www.infoq.com/news/2011/11/MongoDB-Criticism

    ... but things have been quieter for a year so these problems seem to have been sorted out.

    Best wishes,
    Phil Factor

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply