Where's the Schema?

  • Rod at work - Tuesday, July 3, 2018 9:34 AM

    The weirdest schema I've ever seen is known as HL7. I just don't get it. Like Richard Campbell on the Dot Net Rocks podcast called it, "It's the non-standard, standard."

    That looks like EDI

  • Eric M Russell - Tuesday, July 3, 2018 7:28 AM

    .. There's always a schema, and the rules have to be implemented up front, or later on. Whether you use a RDBMS or a NoSQL store, you are going to be dealing with a schema. The question is do you want to deal with it in a central location or in every application? ..

    The reason why most application developers choose a document database is that they simply don't want to "deal with" the schema at all. A java or C# object can be transparently persisted to the database as a JSON object. What happens on the front end is a single method call submitting the ID or session token, and what happens on the backend is a single bookmark lookup, returning a document containing all relevant related data for that customer. For example, a document database is ideal of an eCommerce shopping cart or a mobile app with microservice based architecture. It's schemaless in the sense that they are not designing the schema for a set of relational database tables, they are simply designing the properties of an object class. Maintaining multiple versions of a schema within a collection is technically possible, but that's not the goal. 

    I'm currently working on a project where a customer base of several million can access their account information and purchase / payment history from a mobile app. We're considering Azure DocumentDB.  As a DBA responsible for the availability and performance of SQL Server, I appreciate the fact that there is another option besides having the production online database (or even the data warehouse) getting pounded by thousands of requests per second. Yes, SQL Server can scale up to the task, it's been done before, but the level of effort and expense would be the equivalent of deploying an army to the moon. 

    How do you aggregate the data and do meaningful analysis on these transactions if they are generic or dynamic objects in the data store? I would think a significant ETL effort would be necessary to structure it for reporting purposes? Not saying that I don't see the benefit of it on the development side, but it doesn't sound like a free lunch from an enterprise perspective.

  • Rod at work - Tuesday, July 3, 2018 9:34 AM

    The weirdest schema I've ever seen is known as HL7. I just don't get it. Like Richard Campbell on the Dot Net Rocks podcast called it, "It's the non-standard, standard."

    If that's the 'new' Healthcare EDI standard, I had recruiters a few years ago telling me that I should learn it. Apparently there was/is a lot of money in translating the old and new standards and it was something you could do from home as a freelancer. Never did do it, but I've always wondered if that turned out to be a high demand field.

  • HighPlainsDBA - Tuesday, July 3, 2018 10:24 AM

    Eric M Russell - Tuesday, July 3, 2018 7:28 AM

    .. There's always a schema, and the rules have to be implemented up front, or later on. Whether you use a RDBMS or a NoSQL store, you are going to be dealing with a schema. The question is do you want to deal with it in a central location or in every application? ..

    The reason why most application developers choose a document database is that they simply don't want to "deal with" the schema at all. A java or C# object can be transparently persisted to the database as a JSON object. What happens on the front end is a single method call submitting the ID or session token, and what happens on the backend is a single bookmark lookup, returning a document containing all relevant related data for that customer. For example, a document database is ideal of an eCommerce shopping cart or a mobile app with microservice based architecture. It's schemaless in the sense that they are not designing the schema for a set of relational database tables, they are simply designing the properties of an object class. Maintaining multiple versions of a schema within a collection is technically possible, but that's not the goal. 

    I'm currently working on a project where a customer base of several million can access their account information and purchase / payment history from a mobile app. We're considering Azure DocumentDB.  As a DBA responsible for the availability and performance of SQL Server, I appreciate the fact that there is another option besides having the production online database (or even the data warehouse) getting pounded by thousands of requests per second. Yes, SQL Server can scale up to the task, it's been done before, but the level of effort and expense would be the equivalent of deploying an army to the moon. 

    How do you aggregate the data and do meaningful analysis on these transactions if they are generic or dynamic objects in the data store? I would think a significant ETL effort would be necessary to structure it for reporting purposes? Not saying that I don't see the benefit of it on the development side, but it doesn't sound like a free lunch from an enterprise perspective.

    The answer to your question is that DocumentDB will not (and should not ever IMHO) be used for transaction processing, reporting, or meaningful analysis. That's what the enterprise OLTP databases and data warehouse (SQL Server 2016) are for.

    DocumentDB is nothing more than a denormalized, pre-agrregated, customer-centric datamart that gets refreshed from the data warehouse daily. The mobile app fetches a JSON document containing the customer's payment history, account information, etc. The choice of using a document databases, specifically Azure DocumentDB, is all about scalability, concurrency, availability, and an extremely high volume of request throughput. Maybe the schema-less nature of DocumentDB makes it easier to develope the application and prototype, but that's not even the right reason to choose it for a production database.

    Comparing DocumentDB to SQL Server is like comparing sheet rock to brick and mortar as construction material. In general, neither is better than the other; they are both ideal for specific applications. For example, using brick when you could have used sheet rock will drive up construction costs, and using sheet rock where you should have used brick will cause the building to collapse. Only a fool would ideologically stick to one.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Steve Jones - SSC Editor - Tuesday, July 3, 2018 9:37 AM

    Eric M Russell - Tuesday, July 3, 2018 7:28 AM

    I'm currently working on a project where a customer base of several million can access their account information and purchase / payment history from a mobile app. We're considering Azure DocumentDB.  As a DBA responsible for the availability and performance of SQL Server, I appreciate the fact that there is another option besides having the production online database (or even the data warehouse) getting pounded by thousands of requests per second. Yes, SQL Server can scale up to the task, it's been done before, but the level of effort and expense would be the equivalent of deploying an army to the moon. 

    Have you seen this? https://vincentlauzon.com/2018/06/27/cosmos-db-stored-procedures-handling-continuation/?_lrsc=1c59c385-cfb7-4c28-a1df-9d7d87a627a4
    I think  document stores work well for situations where you almost always deal with singletons. However, when you start to try and work on reports or when your singleton structure changes, you've got to ensure that all apps, which includes ETL, reports, and other potential schema-bound systems (like warehouses) evolve as well. Plus you have to handle the multiple structures in your code.

    I'm not saying that's bad, but it's going to be overhead over time.

    I see document databases as either temporary persistence layers (ie: shopping carts) or pre-aggregated mostly read-only data marts for use by microservice applications.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • HighPlainsDBA - Tuesday, July 3, 2018 10:40 AM

    Rod at work - Tuesday, July 3, 2018 9:34 AM

    The weirdest schema I've ever seen is known as HL7. I just don't get it. Like Richard Campbell on the Dot Net Rocks podcast called it, "It's the non-standard, standard."

    If that's the 'new' Healthcare EDI standard, I had recruiters a few years ago telling me that I should learn it. Apparently there was/is a lot of money in translating the old and new standards and it was something you could do from home as a freelancer. Never did do it, but I've always wondered if that turned out to be a high demand field.

    I dealt with EDI many years ago when I was working in manufacturing. When I got into healthcare and first saw HL7, I recognized it as a form of EDI. Different rules and structures but similar concept.

  • .. There's always a schema, and the rules have to be implemented up front, or later on. Whether you use a RDBMS or a NoSQL store, you are going to be dealing with a schema. The question is do you want to deal with it in a central location or in every application? ..

    I totally agree with this. Schema implementation has to occur at some point, either when you load the data (RDBMS) or when you retrieve it (NoSQL). I think most developers like NoSQL simply because there's less up-front work and managers like it because you can "get things done" sooner, but there's definitely a cost. I've seen NoSQL implementations where lots and lots of coding is done (which translates into higher development and maintenance costs) when, with a little forethought, it could have been done in an RDBMS environment with much lower overall costs and better performance.

  • Eric M Russell - Tuesday, July 3, 2018 11:05 AM

    I see document databases as either temporary persistence layers (ie: shopping carts) or pre-aggregated mostly read-only data marts for use by microservice applications.

    I think those are great uses for documentDBs. Just like Key-Value stores make great lookup storage.

  • Eric M Russell - Tuesday, July 3, 2018 10:56 AM

    HighPlainsDBA - Tuesday, July 3, 2018 10:24 AM

    Eric M Russell - Tuesday, July 3, 2018 7:28 AM

    .. There's always a schema, and the rules have to be implemented up front, or later on. Whether you use a RDBMS or a NoSQL store, you are going to be dealing with a schema. The question is do you want to deal with it in a central location or in every application? ..

    The reason why most application developers choose a document database is that they simply don't want to "deal with" the schema at all. A java or C# object can be transparently persisted to the database as a JSON object. What happens on the front end is a single method call submitting the ID or session token, and what happens on the backend is a single bookmark lookup, returning a document containing all relevant related data for that customer. For example, a document database is ideal of an eCommerce shopping cart or a mobile app with microservice based architecture. It's schemaless in the sense that they are not designing the schema for a set of relational database tables, they are simply designing the properties of an object class. Maintaining multiple versions of a schema within a collection is technically possible, but that's not the goal. 

    I'm currently working on a project where a customer base of several million can access their account information and purchase / payment history from a mobile app. We're considering Azure DocumentDB.  As a DBA responsible for the availability and performance of SQL Server, I appreciate the fact that there is another option besides having the production online database (or even the data warehouse) getting pounded by thousands of requests per second. Yes, SQL Server can scale up to the task, it's been done before, but the level of effort and expense would be the equivalent of deploying an army to the moon. 

    How do you aggregate the data and do meaningful analysis on these transactions if they are generic or dynamic objects in the data store? I would think a significant ETL effort would be necessary to structure it for reporting purposes? Not saying that I don't see the benefit of it on the development side, but it doesn't sound like a free lunch from an enterprise perspective.

    The answer to your question is that DocumentDB will not (and should not ever IMHO) be used for transaction processing, reporting, or meaningful analysis. That's what the enterprise OLTP databases and data warehouse (SQL Server 2016) are for.

    DocumentDB is nothing more than a denormalized, pre-agrregated, customer-centric datamart that gets refreshed from the data warehouse daily. The mobile app fetches a JSON document containing the customer's payment history, account information, etc. The choice of using a document databases, specifically Azure DocumentDB, is all about scalability, concurrency, availability, and an extremely high volume of request throughput. Maybe the schema-less nature of DocumentDB makes it easier to develope the application and prototype, but that's not even the right reason to choose it for a production database.

    Comparing DocumentDB to SQL Server is like comparing sheet rock to brick and mortar as construction material. In general, neither is better than the other; they are both ideal for specific applications. For example, using brick when you could have used sheet rock will drive up construction costs, and using sheet rock where you should have used brick will cause the building to collapse. Only a fool would ideologically stick to one.

    Excellent answer. Thanks.

  • Here is my thoughts as someone who uses a data store with my data warehouse in Azure.

    Most of the data I absorb into the data store is consistent in meaning, I'm expecting the same number of columns every ingestion. Now, what is in those columns may change in terms of data types. Really, it does not matter much. But for the most part, the data has a consistent schema before I actually define one.

    Do I consider this schema-less? Yes. This is because while I know it's consistent, the business itself has not agreed on the schema. It's not part of the data warehouse and is just a document being dropped into the store with little to no constraints.

    For our business, the data store is pretty critical. This has nothing to do with lazyness. It has everything to do with the fact that we are doing analytics and the data we ingest is not fully realized completely. In meaning, we don't have a schema defined for everything. That's because the everything can change depending on the client. Flexibility is key and often that's hard with a predefined schema.

    The other thing, which I feel is totally misrepresented in some of the comments here is data stores are becoming crazy fast for computational analysis and are very much easily fed into the BI layers without a data warehouse or data mart. As soon as data lands in the store, you can start utilizing the powerful analytical job services that are being developed to sit on top of these stores. They can crunch GB's of data, merry them without indexing or statistics, and crank out new datasets that can feed into the BI layers faster than it would take you getting a schema defined and integrated into a data warehouse.

    Does that mean the data warehouse is pointless? No. If you can optimize a model in the warehouse, it will likely always be the sure-fire way to fast and most importantly, high-quality results. The issue is, that takes time and often, the business does not want to wait.

  • HighPlainsDBA - Tuesday, July 3, 2018 10:24 AM

    Eric M Russell - Tuesday, July 3, 2018 7:28 AM

    .. There's always a schema, and the rules have to be implemented up front, or later on. Whether you use a RDBMS or a NoSQL store, you are going to be dealing with a schema. The question is do you want to deal with it in a central location or in every application? ..

    The reason why most application developers choose a document database is that they simply don't want to "deal with" the schema at all. A java or C# object can be transparently persisted to the database as a JSON object. What happens on the front end is a single method call submitting the ID or session token, and what happens on the backend is a single bookmark lookup, returning a document containing all relevant related data for that customer. For example, a document database is ideal of an eCommerce shopping cart or a mobile app with microservice based architecture. It's schemaless in the sense that they are not designing the schema for a set of relational database tables, they are simply designing the properties of an object class. Maintaining multiple versions of a schema within a collection is technically possible, but that's not the goal. 

    I'm currently working on a project where a customer base of several million can access their account information and purchase / payment history from a mobile app. We're considering Azure DocumentDB.  As a DBA responsible for the availability and performance of SQL Server, I appreciate the fact that there is another option besides having the production online database (or even the data warehouse) getting pounded by thousands of requests per second. Yes, SQL Server can scale up to the task, it's been done before, but the level of effort and expense would be the equivalent of deploying an army to the moon. 

    How do you aggregate the data and do meaningful analysis on these transactions if they are generic or dynamic objects in the data store? I would think a significant ETL effort would be necessary to structure it for reporting purposes? Not saying that I don't see the benefit of it on the development side, but it doesn't sound like a free lunch from an enterprise perspective.

    Not speaking on DoucmentDB specifically, but with data stores like Azure Data Lake Store, you can define a schema and the transformations of the data from that schema as you read the data into a new document. This is all accomplished with analytical services that sit on top of the store. As I mentioned in my prior post, these are insanely fast and can do this with little setup on the data in terms of indexing, statistics, and so forth.

    For example, if I have 2 data sources that are being stored in the data store as CSV files on a daily basis, I can query 90 days of each data source in one query (using U-SQL to query from said store) in Azure Data Lake Analytics. This will combine both data sources and read all 180 documents into a schema I define in the query. Then I can JOIN the data  sources together in the same query, transform them, aggregate them, apply R scripts or Python scripts to them and create an entirely new CSV file as the final output. That CSV file can then be fed into PowerBI, Tableau or what I do, slam it right into my data warehouse and feed it to the world.

  • xsevensinzx - Tuesday, July 3, 2018 9:54 PM

    HighPlainsDBA - Tuesday, July 3, 2018 10:24 AM

    Eric M Russell - Tuesday, July 3, 2018 7:28 AM

    .. There's always a schema, and the rules have to be implemented up front, or later on. Whether you use a RDBMS or a NoSQL store, you are going to be dealing with a schema. The question is do you want to deal with it in a central location or in every application? ..

    The reason why most application developers choose a document database is that they simply don't want to "deal with" the schema at all. A java or C# object can be transparently persisted to the database as a JSON object. What happens on the front end is a single method call submitting the ID or session token, and what happens on the backend is a single bookmark lookup, returning a document containing all relevant related data for that customer. For example, a document database is ideal of an eCommerce shopping cart or a mobile app with microservice based architecture. It's schemaless in the sense that they are not designing the schema for a set of relational database tables, they are simply designing the properties of an object class. Maintaining multiple versions of a schema within a collection is technically possible, but that's not the goal. 

    I'm currently working on a project where a customer base of several million can access their account information and purchase / payment history from a mobile app. We're considering Azure DocumentDB.  As a DBA responsible for the availability and performance of SQL Server, I appreciate the fact that there is another option besides having the production online database (or even the data warehouse) getting pounded by thousands of requests per second. Yes, SQL Server can scale up to the task, it's been done before, but the level of effort and expense would be the equivalent of deploying an army to the moon. 

    How do you aggregate the data and do meaningful analysis on these transactions if they are generic or dynamic objects in the data store? I would think a significant ETL effort would be necessary to structure it for reporting purposes? Not saying that I don't see the benefit of it on the development side, but it doesn't sound like a free lunch from an enterprise perspective.

    Not speaking on DoucmentDB specifically, but with data stores like Azure Data Lake Store, you can define a schema and the transformations of the data from that schema as you read the data into a new document. This is all accomplished with analytical services that sit on top of the store. As I mentioned in my prior post, these are insanely fast and can do this with little setup on the data in terms of indexing, statistics, and so forth.

    For example, if I have 2 data sources that are being stored in the data store as CSV files on a daily basis, I can query 90 days of each data source in one query (using U-SQL to query from said store) in Azure Data Lake Analytics. This will combine both data sources and read all 180 documents into a schema I define in the query. Then I can JOIN the data  sources together in the same query, transform them, aggregate them, apply R scripts or Python scripts to them and create an entirely new CSV file as the final output. That CSV file can then be fed into PowerBI, Tableau or what I do, slam it right into my data warehouse and feed it to the world.

    What you just described (flat files stored on bulk storage, schema at runtime, extremely fast analytics, output transformation to another file) sounds like Hadoop; are you talking about using HDInsight?

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Steve Jones - SSC Editor - Tuesday, July 3, 2018 11:39 AM

    Eric M Russell - Tuesday, July 3, 2018 11:05 AM

    I see document databases as either temporary persistence layers (ie: shopping carts) or pre-aggregated mostly read-only data marts for use by microservice applications.

    I think those are great uses for documentDBs. Just like Key-Value stores make great lookup storage.

    Document databases like MongoDB, Riak, and Azure DocumentDB are basically key-value stores, except that they are optimized to contain JSON documents for the values. The thing is, across most database platforms (including RDMS like PostgreSQL and even SQL Server) there is a separation between the logical view projected by the API and the physical representation of data on storage. What the developer sees as a logical tables or documents are not stored on disk as tabular rows or JSON documents, it could be a column store, key-value, or something even more exotic. We're all familiar with SQL Server's ColumnStore table format. Azure CosmosDB currently provides Document, Column-Family, Graph, and Key-Value models, but the database engine unifies it all using something called "atom-record-sequence (ARS)". 
    https://azure.microsoft.com/en-us/blog/a-technical-overview-of-azure-cosmos-db/

    We're in Renaissance Age now, where our view of the databases world is bigger, stranger, more multi-vendor and more cross-platform than what we ever saw before.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell - Thursday, July 5, 2018 7:26 AM

    xsevensinzx - Tuesday, July 3, 2018 9:54 PM

    HighPlainsDBA - Tuesday, July 3, 2018 10:24 AM

    Eric M Russell - Tuesday, July 3, 2018 7:28 AM

    .. There's always a schema, and the rules have to be implemented up front, or later on. Whether you use a RDBMS or a NoSQL store, you are going to be dealing with a schema. The question is do you want to deal with it in a central location or in every application? ..

    The reason why most application developers choose a document database is that they simply don't want to "deal with" the schema at all. A java or C# object can be transparently persisted to the database as a JSON object. What happens on the front end is a single method call submitting the ID or session token, and what happens on the backend is a single bookmark lookup, returning a document containing all relevant related data for that customer. For example, a document database is ideal of an eCommerce shopping cart or a mobile app with microservice based architecture. It's schemaless in the sense that they are not designing the schema for a set of relational database tables, they are simply designing the properties of an object class. Maintaining multiple versions of a schema within a collection is technically possible, but that's not the goal. 

    I'm currently working on a project where a customer base of several million can access their account information and purchase / payment history from a mobile app. We're considering Azure DocumentDB.  As a DBA responsible for the availability and performance of SQL Server, I appreciate the fact that there is another option besides having the production online database (or even the data warehouse) getting pounded by thousands of requests per second. Yes, SQL Server can scale up to the task, it's been done before, but the level of effort and expense would be the equivalent of deploying an army to the moon. 

    How do you aggregate the data and do meaningful analysis on these transactions if they are generic or dynamic objects in the data store? I would think a significant ETL effort would be necessary to structure it for reporting purposes? Not saying that I don't see the benefit of it on the development side, but it doesn't sound like a free lunch from an enterprise perspective.

    Not speaking on DoucmentDB specifically, but with data stores like Azure Data Lake Store, you can define a schema and the transformations of the data from that schema as you read the data into a new document. This is all accomplished with analytical services that sit on top of the store. As I mentioned in my prior post, these are insanely fast and can do this with little setup on the data in terms of indexing, statistics, and so forth.

    For example, if I have 2 data sources that are being stored in the data store as CSV files on a daily basis, I can query 90 days of each data source in one query (using U-SQL to query from said store) in Azure Data Lake Analytics. This will combine both data sources and read all 180 documents into a schema I define in the query. Then I can JOIN the data  sources together in the same query, transform them, aggregate them, apply R scripts or Python scripts to them and create an entirely new CSV file as the final output. That CSV file can then be fed into PowerBI, Tableau or what I do, slam it right into my data warehouse and feed it to the world.

    What you just described (flat files stored on bulk storage, schema at runtime, extremely fast analytics, output transformation to another file) sounds like Hadoop; are you talking about using HDInsight?

    Somewhat, it's Azure Data Lake Store + Azure Data Lake Analytics. It's essentially like HDFS + Hive + Spark etc.

    The Azure Data Lake store is an Apache Hadoop file system compatible with Hadoop Distributed File System (HDFS) and works with the Hadoop ecosystem. Your existing HDInsight applications or services that use the WebHDFS API can easily integrate with Data Lake Store. Data Lake Store also exposes a WebHDFS-compatible REST interface for applications

    Azure Data Lake Analytics is a distributed, cloud-based data processing architecture offered by Microsoft in the Azure cloud. It is based on YARN, the same as the open-source Hadoop platform. It pairs with Azure Data Lake Store, a cloud-based storage platform designed for Big Data analytics.

  • When I think of schemaless data I think of data persisted to a storage medium where that storage medium has no concept of a schema.  It is the responsibility of the consuming app to apply the concept of a schema.
    Where you are persisting data from a mechanical device such as a sensor your have a guaranteed contract and potentially a huge volume of data persisted at very high speeds.  The overhead of validating the schema at persistence time is undesirable and unnecessary.
    If you are persisting data from an evolving web application then schemaless stores give with one hand and take away with another.  For this reason tools such as Apache Avro and the HIVE metadata catalogue have evolved to provide an optimised means of formalising a schema on a "schemaless" store.

    Key-value stores are useful needs transient objects retrieved by a key.  Only the application knows anything about the object which is the value part of the key-value store.
    Document stores are useful where you may need the data store to allow querying and filtering by parts of an object other than the key.  They are great for forum and survey type applications.

    Stores such as ElasticSearch are great when it comes to scanning a large volume of semi-structured data such as log files.

    These types of store are great when used for the specific purpose for which they were designed.  The problem comes when the use case evolves to a more general need.  RDBMS are a good all round solution for general use. 

    RDBMS have been maturing for decades.  The NOSQL solutions haven't but shouldn't be dismissed.  If you were to put log files into an RDBMS then it would be an expensive way of providing a facility that is used sporadically.
    Similarly if you use an RDBMS for transient session state then you are burning IO and incurring the cost of the 2 phase commit for data that might have a shelf life of a few seconds.  The need for a formalised schema is limited to the application, there is no requirement for a long term schema.

Viewing 15 posts - 16 through 29 (of 29 total)

You must be logged in to reply to this topic. Login to reply