The Power of the Data Model

  • xsevensinzx - Wednesday, February 15, 2017 1:49 AM

    As I start to dive more and more into the NoSQL realm, I'm actually inclined to disagree with the typical modeling practices being more important. The amount of time spent on modeling and maintaining that model in certain scenarios can be extremely tedious, slow, and expensive in certain situations.

    Dumping data into a big bucket after conforming and validating it just seems more painless and flexible. "All your data, whatever it may be, for or against the model, now exists here. Have fun!" I really like the sound of that, but not without a price. What you fail to do on the back-end can transfer to the front-end. However, shifting that power of defining the model on the back-end to the front-end is surprisingly better for those who are not database professionals. That's because now, they can model the data on the front-end any way they want and fully control it.

    My concern with this is that you make things REALLY easy for the people who write the data and they may do this only once.  But you make it REALLY hard for the people who read data who may do it many times.
    It is an over-simplification but writing data incurs cost, reading data generates revenue.

    I became a DBA because my work as a data analyst was severely hampered by the structure and quality of the data and that I was receiving.  A vast amount of time was spent doing stuff to data rather than doing stuff with data.
    Vast amounts of time were spent getting data into a form where I could derive insight from it and then I was up against the clock, having burned most of my time, in getting something of value out by the proscribed deadlines.

    Schema on read is great when you have an (largely) unchanging schema and a disciplined data source.  It's a bit of a nightmare when trying to do something with todays ill thought out whimsy.

    Document stores may allow developers to do object.save() but designing the document structure is pretty important.

  • Could you elaborate on how this works a little more?

    It seems to me that this is a really dangerous state of affairs - and I'll set aside the obvious problem of different users creating different models with different semantics.

    Sure. We are in a new age where we have a number of visualization (i.e.: PowerBI, Tableau, ETC) tools that allow us to remodel the existing data. We are able to take the original model, transform it without impacting the physical model, and transpose data in new ways within the front-end tools. This is what I mean by shifting that modeling from the back-end to the front-end. It means that instead of having the DBA, ETL Developer, whomever do the work, you can now have Report Developers, Visualization experts and so forth take on similar responsibilities with managing their own models/reports.

    Over my career there have been countless occasions on which someone in the business has come over to IT looking for "help with a query". You listen to their problem, and you start to wonder what their actual end goal might be

    This is true, but now tools are generating SQL code and more (i.e.: NoSQL tools) based on point-and-click models. While you can argue it's not the best optimized code, it does get the job done where you don't need to poke IT for a query. You can depend on the front-end application to do that for you and populate the model you created and or go directly to a report. This becomes a self-serving model, which is good.

    I could list off dozens of examples of this

    This is ignorance and has nothing to do with the technology, model and so forth. In my experience, the analyst have a good understanding of the data. They know the difference between the attributes and metrics. They have a understanding of the grains and how to ask questions of the data. When you run into cases where a user is averaging and average, they are completely oblivious to the data. They really have no right touching it. Yes, I know this is common, but you can either take your approach to your data seriously, or you can just randomly pull data without knowing what you're doing.

    My concern with this is that you make things REALLY easy for the people who write the data and they may do this only once. But you make it REALLY hard for the people who read data who may do it many times.
    It is an over-simplification but writing data incurs cost, reading data generates revenue.

    I became a DBA because my work as a data analyst was severely hampered by the structure and quality of the data and that I was receiving. A vast amount of time was spent doing stuff to data rather than doing stuff with data.

    Vast amounts of time were spent getting data into a form where I could derive insight from it and then I was up against the clock, having burned most of my time, in getting something of value out by the proscribed deadlines.

    This is very true. If you can find a consistent model that always answers the same questions then a predefined model will always win. But, there are plenty of instances where flexibility is needed. Having the power to choose is what makes many products, regardless if they deal with data, succeed. Giving the end user the power to choose a different path is pretty powerful. Most analyst want that power, but they struggle with harnessing and using that power because of the limited technology skill-set they have (hence the prior poster referencing constantly getting poked for a new query). This is why in my experiences, a predefined model is not everything. Most analyst are fine with building a case to answer their data questions as long as they have the power to do so. But, it really depends on the situation and business.

    For example, I work with a lot of data scientist and data analyst who all know the data and love to have flexibility with said data. Even with predefined models, questions of the data are constantly changing. This means I'm constantly having to keep up with that change.

    With the proper business process--not technology--you can still maintain a flexible model on the front-end that's in the hands of the visualization/reporting team. And before anyone says it, not every analyst or business user has good understanding of data. This is what I focus most of my time on; education. I'm training the end users to be self-serving. I'm making them into mini-SQL developers, data modelers and more. I'm giving them the power to build their own worlds where I can just sit back and eat everyones lunch. 😀

  • It appears this thread has morphed into a Data Governance topic.  We have laws at the national, state, and local levels just as we should have data models at each level of a national corporation.  Each level should also have it's own conceptual models.  Perhaps a data architect could maintain models for upper, middle, and lower management so the "grammar" is correct.  But, I would argue I.T. and I.S. are staff functions and it is managers who own the conceptual models and the right to reorganize the business - not Data Architects.

    The article left many of us wanting to see the conceptual model the author referred to.  After fretting about it, I went to google images and searched on Conceptual Models and immediately saw the value of creating an agreed upon Systems Model of a business.  See the results yourself and be amazed!

  • xsevensinzx - Wednesday, February 15, 2017 5:38 AM

    This is true, but now tools are generating SQL code and more (i.e.: NoSQL tools) based on point-and-click models. While you can argue it's not the best optimized code, it does get the job done where you don't need to poke IT for a query. You can depend on the front-end application to do that for you and populate the model you created and or go directly to a report. This becomes a self-serving model, which is good.

    I agree that self service is good, and that there are tools out there which generate code from visual designers. And I absolutely agree that this code is universally terrible for anything but the most trivial query. But that's not where my main concern lies. It lies with the fact that, without a well structured model to provide guidance, it is difficult to turn data into useful information. Indeed, that's one of the primary goals of someone who actually creates the models, all the way from conceptual to physical schema. Even with this guidance users sometimes get lost. They ask a question (ie, write a query, be it directly in code or via a designer) from which, yeah, they get a result set. But the result set doesn't really have anything to do with the question they really meant to ask. A well structured model will significantly reduce this kind of error, but still never quite eliminates it. I can only presume that the absence of any model at all would massively increase errors of this kind.

    has nothing to do with the technology, model and so forth

    I agree it doesn't have anything to do with the technology, but I claim it has a lot to do with the model, as per what I said above. A clear model provides clear semantics, thus reducing semantic errors. Would you agree? I concede that a well trained analyst who knows how to logically structure their question and is familiar with the business may well be able to figure out the correct semantics. But at this point we're not really describing an analyst, we're describing a database programmer!

    Even with predefined models, questions of the data are constantly changing. This means I'm constantly having to keep up with that change.

    By this do you mean there's a need to keep changing the model? If so then it sounds like a problem with the model. If the model accurately represents reality of the business then any question one can reasonably ask about the business, one can reasonably ask of the model, by definition. The need to change a model in response to changing questions tells me that the model includes assumptions about entity relationships, or cardinalities, or the appropriate location of attributes, etc, that weren't correct. Ie, this happens if and only if the model is not an accurate representation of reality in the first place.

  • Don Halloran - Wednesday, February 15, 2017 8:10 AM

    xsevensinzx - Wednesday, February 15, 2017 5:38 AM

    Even with predefined models, questions of the data are constantly changing. This means I'm constantly having to keep up with that change.

    By this do you mean there's a need to keep changing the model? If so then it sounds like a problem with the model. If the model accurately represents reality of the business then any question one can reasonably ask about the business, one can reasonably ask of the model, by definition. The need to change a model in response to changing questions tells me that the model includes assumptions about entity relationships, or cardinalities, or the appropriate location of attributes, etc, that weren't correct. Ie, this happens if and only if the model is not an accurate representation of reality in the first place.

    That's where the problem lies. You have to define a model around all the questions you already know you want to ask. What about all the questions you have yet to ask? The ones you have yet to discover? That's where having a square box refines the end users to a point where they cannot think outside of the box because it breaks the model.

    It's not bad to have that model, that standard, but other times it can be restricting if those questions are constantly changing with either new data sources, or new ways of summarizing/aggregating existing datasets.

  • Don Halloran - Wednesday, February 15, 2017 1:24 AM

    I enjoy discussions about abstract models, partly because that's my primary area of expertise, and partly because there aren't enough of them!

    In any case, I agree that a good conceptual model can lead to a shared, unified language across the whole enterprise - IT and non-IT both included, and that this is one of the highlights of a good model. Sometimes it really does come down to the apparently - but not actually - simple task of naming things correctly. Clear, precise names mean clear, precise concepts in dicussion, which promotes in clear, precise requirements and clear, precise solutions. 

    On the other hand, I'm not so certain about the model itself being something that non-IT people need be exposed to directly. There's a sort of common misconception that data modeling is something anyone can do on intuition alone. And, sure, anyone *can* look at a business, its process and entities, and come up with some kind of model. But it takes actual expertise and experience to come up with a correct model.

    What do I mean by correct? I mean one with is, in fact, a representation of the subset of reality being modeled, and which also (and therefore) contains no internal contradictions. Reality, after all, entails no contradictions. Therefore if the model does, then the model is not a representation of reality. 

    Sure, sometimes this is trivial. But sometimes it is not, and being able to see when it is not is a skill, and it is this skill which requires expertise and experience. One needs to be able to look "through" the model and see all of the logical consequences of what has been wrought. I have found, in the past, that exposing business people to the model directly will result in them making suggestions regarding changes directly to the model - "let's put this column over here instead", or "this seems complicated, can't we just do x?", or "Why is inventory separate from asset? They're the same". (no, they're not), or "Why is customer separate from account? They're the same too" (no, they're not either). This is not a situation you want to be in. What the business people should be telling you about are the functional requirements, the business processes, and so on. They are absolutely not qualified to build, or critique, a data model as such.

    When I said that the conceptual "belongs" to the business - I don't think they necessarily have to own the drawing itself.  They do however have to weigh in on the modeling to agree that it represents how they view the world, and that it ultimately speaks to how they operate.  To me that's always been the gold standard.  As to some of the objections "the business" might have to some distinctions - the specific examples seem to point to more of having the "wrong part" of the business there. There may very well be parts of the business where the customer AND the account are in fact not distinct; others however DO need to make a distinction, which might make the account vs the customer different.  If however there are no business units within your org where there is an actual distinction - then what value does it serve to make two different entities?

    And Bill T - I fully agree that there's a definite link back to data governance.  One is just a visual representation of the other (admittedly of a subset of the other); they HAVE to be in agreement if you hope to have both.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Kimball stressed that you DON'T model to a desired output otherwise you ARE modelling for a defined set of questions.  If you model for what things are rather than for how you want to use them today then you can ask questions of the model that you have yet to think of.

    Products like Cassandra expect models to be physicalised for the specific questions that they have to answer.  That is the compromise they have chosen to give extreme read performance.

    HP Vertica stores its data in Projections not in tables.  A table is simply a metadata collection.  There will always be at least one Super Projection containing every field and record you would expect in the table.  Projections are optimised physicalised selections from the table.  Sort order and field order is important in gaining performance.  I believe that it also has the concept of prejoin projections which materialise data produced from joining more than one table.

    Both Cassandra and Vertica may hold more than one version of the truth.  This is also a compromise they have chosen

  • David.Poole - Thursday, February 16, 2017 12:10 AM

    Kimball stressed that you DON'T model to a desired output otherwise you ARE modelling for a defined set of questions.  If you model for what things are rather than for how you want to use them today then you can ask questions of the model that you have yet to think of.

    Let me try to not be so wordy this time. 🙂

    Most model for the desired output to start. It's not a bad idea to think about the questions you want to ask of the data, just as long as you don't get too specific where the model cannot be flexible to the questions you have not asked yet. Most do not know the questions they have yet to ask, but they have at least an idea to help define the data sources, certain must-have attributes and metrics. Otherwise, how would you know where to start? You can't possibly absorb every data source in the world right? You have to have a balance. At least that's how I interpret it anyways.

    In most cases in my experience, model is based on taking raw untouched data and conforming said data into a unified ecosystem. Defining relationships, cardinality, granularity, etc. Data goes from State 0 to State M based on the methodology, which is completely controlled by the ETL and model constraints.

    Think the issue I've run into is traversing the various states of the data and adding change to those states. I very much take the approach of keep sources in State 0 within their own schemas where change can happen to each source without impacting other sources. But this means billions of records as opposed to thousands of predefined summarization data that is more vague than it's original form because of that model (i.e.: going from seconds to days to improve readability).

    I think a lot of taking the approach of defining that model and ensuring governance over said model. Now insert the document stores. Gobs and gobs of data documents that is structured and non-structured. State 0 data that is untouched. You have the flexibility to define the model of all of these data sources on the front-end and now have the perfomance juice to query it more efficiently.

Viewing 8 posts - 16 through 23 (of 23 total)

You must be logged in to reply to this topic. Login to reply