The struggle to renormalize data

It is a problem that we face increasingly. We get an export of a NoSQL database, in JSON, and are asked to import the information into a relational database. It could be for further downstream processing, or for audit: It could be that the BI people need to access the information from PowerBI.

It can be a fairly trivial exercise if the part of the JSON document that you need represents what is basically a table. The OPENJSON function in SQL Server is the most obvious way of importing objects and properties from the JSON document as the more familiar rows and columns.

If that works fine, then you're lucky. Often, it isn't that easy: JSON can range from simple to highly nested and complex. It is at that point that you discover the dubious joys of 'renormalizing' what has, effectively, been denormalised.

Relational databases have a single general way of modelling relationships, though it can get more complex with relationship tables that are used to represent hierarchies or many-to-many relationships. How do NoSQL databases do it? Quoting from MongoDB's site "there's a whole rainbow's worth of ways. MongoDB has a rich and nuanced vocabulary for expressing what, in SQL, gets flattened into the term 'One-to-N'"

It always makes me nervous when I'm told that there are a lot of different ways of doing something. In many NoSQL databases, the cardinality of N makes a serious difference, as well as the direction (parent-to-child or child-to-parent). When you are denormalizing, you need to judge how big N is. You can merge tables into one collection, or into a parent collection with an array of embedded child documents. As N gets bigger, you do better to embed parent data as nested sub-document in child collection. Of course, if N is enormous you should store the reference to the parent in the child document.

To renormalize this sort of data, you need to understand the strategy that was used to get from the database schema, or object domain model, to its representation in JSON. This is even harder than it sounds because the representation of the data model that you see in the JSON document may have been chosen more for performance reasons. In other words, it will depend as much on the patterns of data access as on the logical arrangement of the data.

In short, you can't work out the underlying data model, and so renormalize the data, just by looking at the JSON. There have been ingenious attempts to assist with 'renormalization', but these are nothing more than props to make the manual process less stultifying and tedious. Therefore, before considering a 'polyglot' environment of different types of database in your organisation, it pays to have a very clear idea of the way that data is to pass between different systems, and to agree that schema upfront. The transport of data is the least of your headaches, whereas preserving the relationships within data can cause a lot of grief.

Phil Factor.