Pre-built or Ad Hoc

  • Comments posted to this topic are about the item Pre-built or Ad Hoc

  • I think Steve has a point here, but I think it also goes deeper, into the philosophy (if you want to call it that) of having a pre-defined "structure" or "model" or not. This was debated before SQL databases took over the world, and it is still unresolved.

    Some users really do want to work with data "as captured" and work with tools which allow them to impose their own structure on it. I've worked with some of them, and they are often people who understand the value of a formal model. In fact that seems to be what they are trying to construct. In my experience they are also quite rare. A more common breed are the ones who "want pre-built structures they can count on and use reliably to answer questions". They are not interested in the abstract thinking required to create the model, they just want data which they can use to give them reliable answers to their questions.

    The last time I did it, creating the data model was one of the "hard bits". That kind of abstract thinking is rarely easy and many of the people who want (and need) to use a data model are not well suited to creating it. This is a paradox.

    I agree that "learning how to access new sources, like Hadoop, and present that data back to users in a familiar format" seems like a good thing to do. Myself, I'm trying work out how to do it! On the one hand I want to retain the flexibility which is inherent in NoSQL, but on the other I want the reliability of "columns and rows" (and constraints too, if I can have them) and I want to get from one to the other quickly, without the struggles I remember from the past.

    Coincidentally, I just stumbled across a 30-year-old text on converting a logical data model to a _hierarchical_ database design. I guess that is going from "logical" to something one or more stages less flexible than a relational database. Problems in this area are not new!

    Tom Gillies LinkedIn Profilewww.DuhallowGreyGeek.com[/url]

  • The movement away from schema-defined XML to schemaless JSON will only exacerbate the problem. With NOSQL work still has to be done, but processing is deferred from ETL to query time.

    There's another issue here too: The demands of developers to be able to deliver solutions without the RDBMS knowledge to ensure a successful implementation.

    Is this a consequence of a mismatch between the content of computer science courses and the requirement for RDBMS training in the real world?

  • iposner (4/13/2015)


    There's another issue here too: The demands of developers to be able to deliver solutions without the RDBMS knowledge to ensure a successful implementation.

    Is this a consequence of a mismatch between the content of computer science courses and the requirement for RDBMS training in the real world?

    Actually it probably has more to do with the mismatch between good principles of object-oriented design and good principles of relational database design. They're very different ways of thinking. But you probably have a point.

    And while Drs Boyce and Codd founded their work on solid mathematical principles, sometimes reality can be a real pain in the backside. Sometimes things just don't fit - or don't fit very well - into a normalized relational database.

    Various NoSQL v SQL religious wars notwithstanding, the data world is certainly realizing there are a number of data storage strategies and not everything is a one-size-fits-all solution. Like everything else, it comes down to evaluating the job at hand, coming to a decision on what techniques and technologies to use that would best serve that job, then coding appropriately rather than simply sticking to a given technology because it's comfortable.

    On the other hand, business reality means you often have to make the best of what you have to work with. Folks who can successfully navigate those challenges are the ones companies pay to keep.

    ____________
    Just my $0.02 from over here in the cheap seats of the peanut gallery - please adjust for inflation and/or your local currency.

  • Working both as DBA and as .NET developer I can see the added value of NOSQL (not only SQL) solutions during the rapid development of new solutions. Reports are generally not required until the solutions holds a substantial amount of data. If both the client and the developer agree upon using some form of agile development method, reports will have very low priority until the application has been up-and-running for quite a while. At that point a well defined and well structured data model really pays off, but more often than not this becomes the moment to realize that the data store is not suitable for reporting at all. However, convincing clients to take a step back and think about the data they want to get out off their application instead of thinking about the data they want to put into their application early in the development process is not a simple task.

    The net result is generally some kind of ETL (extraction, transfomation and load) into a tabular form. Extracting report data from a database designed for large volume transaction processing is not that uncommon anymore. I guess we'll get acustomed to new this kind of ETL on those not-so-well-structured data sources in time and maybe some new methods and tools will help us out when these data stores become more and more common. Don't blame it on the mismatch between object oriented design and the relational data model; If an application is well designed from the ground up, then it is always possible to design a structured data model that fits the needs. It's a sad thing that no one can predict how much development hours a good application design will save during the rest of its lifetime. That makes it so hard to sell any effort you put into it to a client who wants you to 'just make it work'.

    Developer education is another point of concern. It is hard to grasp the principles of general programming, data structures, relational algebra, set based processing, object oriented development and many more patterns, algorithms and paradigms used in every real life application. But it seems to become harder and harder to find well educated developers (or even programmers) each and every year. Let's hope for the best and prepare for the worst ...

  • Well, it's a schema on read environment versus schema on write. You have to accept the pros and cons if you want to utilize schema on read. Things like having to spend time exploring the data because it's not self-documented when changes happen due to that flexibility you mentioned.

    So, I really do agree there are users that want things defined and those that don't mind exploring to define. Hive makes it easy to create those schemas in tables or views when it is found for Excel reporting. However, if you're dealing with a group that does not care and wants it structured every time, then you may be facing a uphill battle with schema on read.

    In my mind, NoSQL is the one true data platform that can work with RDBMS that would structure the chaos.

  • vliet (4/13/2015)


    However, convincing clients to take a step back and think about the data they want to get out off their application instead of thinking about the data they want to put into their application early in the development process is not a simple task.

    This is a very good point to emphasize too. I actually wrote a blog article on this subject as well.

    What's most important? The core product will always come first. Other features that you need such as reporting or even customer support features and so forth come in at a distant second, third, fifth or never. There simply is no easy way to get everyone on board when talking about operations, reporting, marketing and so forth in relation to their needs after the product has been brought to market.

    Unfortunately, when you find yourself in those situations, by the time you get around to adding those features, you quickly find out that it's not as easy to add them because of what you have already developed. So, you're left with a lot of roadblocks after the fact.

  • Keep in mind that Mongo and Hadoop should be thought of as data capture technologies. They can make it very fast to store data requiring later processing into something useful. They do it by dispensing with data integrity and use parallel processing (usually multiple machines). This has its use when your RDBMS can't handle the load. But other than document storage or log processing you end up processing NoSQL data into a RDBMS anyway to make it useful and available such as in a data warehouse.

    The truth of the matter is most software shops don't come close to pushing the limits of their RDBMS because they simply don't have the skills to do so. Many programmers don't like to do the tried and true but gravitate towards fads due to ignorance and the salesmanship of product evangelists. They will happily work long hours creating the next big mess and badge themselves as NoSQL experts (and get away with it since there is no one around to say otherwise). The idea that people choose the best tool for the job in the real world is pure fantasy.

  • I am primarily a developer and much of what I have seen for NOSQL comes from what I call the "Google" effect. It is well known that Google doesn't use a RDBMS for their indexing. And if one considers the problem domain for web pages, then it only makes sense.

    But most of us are writing LOB applications and a schema is there. I do see some value for allowing to capture that portion that doesn't fit into our current structure. But over time, the vast majority of information fits into a schema based collection quite well. (As long as we get the basic concepts of normalization to begin with.)

    Add in the fact that many developers are self taught and don't have basic concepts of program design and what can you hope for in db design?? I end up seeing many relationships that should be 1 to 0 or 1 end up as 1 to many. I see many to many when it should be 1 to 1. But if we can use those cool NOSQL toys we can hide the lack of ability. And all works well until they try to make a report. But then - hey on to the next adventure because it isn't green field anymore.

  • No matter how you store it, the reader still has to understand it or it makes no difference. If having no structure works so you can still present it well to the reader then great.

  • EGray (4/13/2015)


    Keep in mind that Mongo and Hadoop should be thought of as data capture technologies. They can make it very fast to store data requiring later processing into something useful. They do it by dispensing with data integrity and use parallel processing (usually multiple machines). This has its use when your RDBMS can't handle the load. But other than document storage or log processing you end up processing NoSQL data into a RDBMS anyway to make it useful and available such as in a data warehouse.

    The truth of the matter is most software shops don't come close to pushing the limits of their RDBMS because they simply don't have the skills to do so. Many programmers don't like to do the tried and true but gravitate towards fads due to ignorance and the salesmanship of product evangelists. They will happily work long hours creating the next big mess and badge themselves as NoSQL experts (and get away with it since there is no one around to say otherwise). The idea that people choose the best tool for the job in the real world is pure fantasy.

    Yep. But, I think the trending terminology is, "Data Lake".

    NoSQL is the reservoir (chaotic schemaless data) where RDBMS is the treatment plant (structured relational data) serving water to the city (the business; end user).

  • Reporting in NoSQL databases is generally going to be through services or interfaces developed by the same teams who did the data model. That works great (performance questions aside) provided that your questions can be answered through those interfaces. Ad Hoc reporting is possible, but harder, because of schema inconsistencies that are not easily discovered. Those services can also serve as ETL sources to dump into a more traditional warehouse for querying. You just need to have processes in place to keep those interfaces up to date, since it's easy to defer them unless they explicitly break.

  • cdesmarais 49673 (4/13/2015)


    Reporting in NoSQL databases is generally going to be through services or interfaces developed by the same teams who did the data model. That works great (performance questions aside) provided that your questions can be answered through those interfaces. Ad Hoc reporting is possible, but harder, because of schema inconsistencies that are not easily discovered. Those services can also serve as ETL sources to dump into a more traditional warehouse for querying. You just need to have processes in place to keep those interfaces up to date, since it's easy to defer them unless they explicitly break.

    Its not just the NoSQL databases that are like this. There are many normal RDBMs that have developers using them as dumb storage for their ORM. Have been in meeting where I have asked how reporting is supposed to be done and have had blank looks as the only response.

  • Interesting deck from Martin Fowler: http://martinfowler.com/articles/schemaless/

  • Steve, Thanks for the pointer to Martin Fowler's piece. I enjoyed that. It's a really good summary of the arguments.

    The contributions to this discussion have been thought-provoking too. For some reason I hadn't included XML schemas in my thinking. I do now!

    I know SQL and NoSQL and XML are just tools. All can be used well and all can be abused. Like Fowler, my preference is for "a schema" because it gives "something to validate against" (which is all in one place too!). I also remind myself that the design of an SQL database for OLTP looks rather different to an OLAP one for the same domain.

    "The Google Effect" is a good point. "What searches people have performed" and "the contents of websites" are neither valid, nor invalid, they just _are_!

    I like the idea of a "data lake". I understand the point about having some kind of ETL process to transform something untrusted into a more trusted and structured form. I would be wary about carrying it too far though - to pursue the water metaphor, most civilised places have rules about polluting rivers or lakes that are used as a source for drinking water.

    I've learned something about my prejudices here and I think I want to learn more about NoSQL, and from there what the relative merits of the two approaches are. More research for my CPD plan!

    Tom Gillies LinkedIn Profilewww.DuhallowGreyGeek.com[/url]

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply