I had a previous blog comparing Cosmos DB to a relational database (see Understanding Cosmos DB coming from a relational world) and one topic that it did not address that I want to now is how to handle reference tables that are common in the relational database world.
A big difference with Cosmos DB compared to a relational database is you will create a denormalized data model. Take a person record for example. You will embed all the information related to a person, such as their contact details and addresses, into a single JSON document. Retrieving a complete person record from the database is now a single read operation against a single container and for a single item. Updating a person record, with their contact details and addresses, is also a single write operation against a single item. By denormalizing data, your application typically will have better read performance and write performance and allow for a scale-out architecture since you don’t need to join tables.
(Side note: “container” is the generic term. Depending on the API, a specific term is used such as “collection” for the Cosmos DB API). Think of a container as one or more tables in the relational world. Going a little deeper, think of a container as a group of one or more “entities” which share the same partition key. A relational table shares a schema, but containers are not bound in that way.)
Embedding data works nicely for many cases but there are scenarios when denormalizing your data will cause more problems than it is worth. In a document database, you can have information in one document that relates to data in other documents. While there may be some use cases that are better suited for a relational database than in Cosmos DB (see below), in most cases you can handle relationships in Cosmos DB by creating a normalized data model for them, with the tradeoff that it can require more round trips to the server to read data (but improve the efficiency of write operations since less data is written). In general, use normalized data models to represent one-to-many relationships or many-to-many relationships when related data changes frequently. The key is knowing whether the cost of the updates is greater than the cost of the queries.
When using a normalized data model, your application will need to handle creating the reference document. One way would be to use a change feed that triggers on the creation of a new document – the change feed essentially triggers an Azure function that creates the relationship record.
When using a normalized data model, your application will need to query the multiple documents that need to be joined (costing more money because it will use more request units), and do the joining within the application (i.e. join a main document with documents that contain the reference data) as you cannot do a “join” between documents within different containers in Cosmos DB (joins between documents within the same container can be done via self-joins). Since every time you display a document it needs to search the entire container for the name, it would be best to put the other document type (the reference data) in a different container so you can have different partition keys for each document type (read up on how partitioning can make a big impact on performance and cost).
Note that “partitioning” in a RDBMS compared to Cosmos DB are different things: partitioning in Cosmos DB refers to “sharding” or “horizontal partitioning“, where replica sets which contain both the data and copies of compute (database) resources operating in a “shared nothing” architecture (i.e. scaled “horizontally” where each compute resource (server node) operates independently of every other node, but with a programming model transparent to developers). Conversely, what is often referred to as “partitioning” in a RDBMS is purely a separation of data into separate file groups within a shared compute (database) environment. This is also often called “vertical partitioning”.
Another option that is common pattern for NoSQL databases is to create a separate container to satisfy specific queries. For example, having a container for products based on category and another container for products based on geography. Both of those containers for my query/app are being sourced from one that is my “main” or “source” container that is being updated (front end, or another app) and the change feed attached to that pushes out to my other containers that I use for my queries. This means duplicating data, but storage is cheap and you save costs to retrieve data (think of those extra containers as covering indexes in the relational database world).
Since joining data can involve multiple ways of reading the data, it’s important to understand the two ways to read data using the Azure Cosmos DB SQL API:
- Point reads – You can do a key/value lookup on a single item ID and partition key. The item ID and partition key combination is the key and the item itself is the value. For a 1 KB document, point reads typically cost 1 request unit with a latency under 10ms. Point reads return a single item
- SQL queries – You can query data by writing queries using the Structured Query Language (SQL) as a JSON query language. Queries always cost at least 2.3 request units and, in general, will have a higher and more variable latency than point reads. Queries can return many items. See Getting started with SQL queries
The key in deciding when to use a normalized data model is how frequently the data will change. If the data only changes once a year it may not be worthwhile to create a reference document and instead just do an update to all the documents. But be aware that the update has to be done from the client side spread over the affected documents, doing it in batches as one big UPDATE statement does not exist in Cosmos DB. You will need to retrieve the entire document from Cosmos DB, update the property/properties in your application and then call the ‘Replace’ method in the Cosmos DB SDK to replace the document in question (see CosmosDb – Updating a Document (Partially)). If you are using SQL API and .NET or Java, you can consider using bulk support (.NET) or bulk executor (Java). Other ideas would involve using change feed, or if you really need a level of ACID consistency, you can achieve this using stored procedures, with snapshot isolation scoped to a single partition (this is not the same as stored procedures in SQL – rather these are designed specifically to support multi-doc transactions).
Also be aware that because there is currently no concept of a constraint, foreign-key or otherwise, any inter-document relationships that you have in documents are effectively “weak links” and will not be verified by the database itself. If you want to ensure that the data a document is referring to actually exists, then you need to do this in your application, or through the use of server-side triggers or stored procedures on Azure Cosmos DB.
What are OLTP scenarios where a relational database is essential?
Avoiding the broader topic of when to use a relational database over a non-relational database, there are a few use cases where a relational database is essential:
- The customer experience and comfort zone is with relational databases. It is a reality that relational databases are ahead in the maturity curve with respect to tooling (an example would be foreign-key constraint behavior). However, it should be noted that this is not the same as saying that “more use cases are technically better suited to the relational model”. Rather, the barrier to entry in new customer projects tends to be lower because mindshare is greater in the relational space. In these cases, it often isn’t worth the effort for companies to upskill
- The system really needs strict ACID semantics across the entire dataset. Sharded/partitioned databases like Cosmos DB will not provide ACID guarantees across the entire set of physical partitions (and likely never will). In reality, however, the use cases where this is necessary is quite small. Things like transaction management and other SDK-level things that go along with these aspects come easier in the RDBMS space, but this is really the same as above point – RDBMS is ahead on maturity curve for user-level tooling to help abstract paradigm specific concepts – but this does not make the paradigm better suited to a greater number of use cases
- Having a single data store that services both operational and analytical needs with equal utility, including tabular models – this is probably the most powerful argument, and NoSQL engines are likely never going to serve a data structure that coalesces as well into tabular models that produce reports, charts, graphs, etc. But again, history has proven that, at scale, the “one fits all” approach can have some non-trivial drawbacks. And the new Analytical Store in Cosmos DB is addressing the need to service both operational and analytical needs
You can create complex hierarchical “relationships” in Cosmos DB, which would have to be modelled in separate tables in an RDBMS. Cosmos DB can’t handle them using joins – but again, this is a paradigmatic/semantic difference, not a fundamental flaw in the database model itself. In order to do the equivalent of what one may be trying to achieve in a relational database, you may have to “unlearn what you have learned”, but this comes back to your comfort level with a RDBMS, which is not a trivial thing and can be the main and very valid reason for staying with a RDBMS.
In summary, in a NoSQL database like Cosmos DB, most use cases are covered. Some things are a little harder (due to lack of maturity in tooling), but most things are easier, many things can only be done in NoSQL (i.e. handling millions of transactions per second), and very few things cannot be done in a NoSQL database. Most NoSQL engines are characterized by having a lot more configurability, tunability, and flexibility than a RDBMS. And in many ways, that is the hardest challenge for newcomers.
The post Cosmos DB for the SQL Professional - Referencing Tables first appeared on James Serra's Blog.