Importing JSON Collections into SQL Server

Microsoft introduced native support for JSON in SQL Server in version 2016. In this article, Phil Factor shows how to import JSON documents into to SQL Server database tables, even if there is a bit of complexity in the original data.

It is fairly easy to Import JSON collections of documents into SQL Server if there is an underlying ‘explicit’ table schema available to them. If each of the documents have different schemas, then you have little chance. Fortunately, schema-less data collections are rare.

In this article we’ll start simply and work through a couple of sample examples before ending by creating a SQL server database schema with ten tables, constraints and keys. Once those are in place we’ll then import a single JSON Document, filling the ten tables with the data of 70,000 fake records from it.

Let’s start this gently, putting simple collections into strings which we will insert into a table. We’ll then try slightly trickier JSON documents with embedded arrays and so on. We’ll start by using the example of sheep-counting words, collected from many different parts of Great Britain and Brittany. The simple aim is to put them into a table. I don’t use Sheep-counting words because they are of general importance but because they can be used to represent whatever data you are trying to import.

You will need access to SQL Server version, 2016 and later or Azure SQL Database or Warehouse to play along and you can download data and code from GitHub.

Converting Simple JSON Arrays of Objects to Table-sources

We will start off by creating a simple table that we want to import into.

We then choose a simple JSON Format

We can very easily use OpenJSON to create a table-source that reflects the contents.

Once you have a table source, the quickest way to insert JSON into a table will always be the straight insert, even after an existence check. It is a good practice to make the process idempotent by only inserting the records that don’t already exist. I’ll use the MERGE statement just to keep things simple, though the left outer join with a null check is faster. The MERGE is often more convenient because it will accept a table-source such as a result from the OpenJSON function. We’ll create a temporary procedure to insert the JSON data into the table.

Now we try it out. Let’s assemble a couple of simple JSON strings from a table-source.

Now we can EXECUTE the procedure to store the Sheep-Counting Words in the table

Check to see that they were imported correctly by running this query:

Converting to Table-source JSON Arrays of Objects that have Embedded Arrays

What if you want to import the sheep-counting words from several regions? So far, what we’ve been doing is fine for a collection that models a single table. However, real life isn’t like that. Not even Sheep-Counting Words are like that. A little internalized Chris Date will be whispering in your ear that there are two relations here, a region and the name for a number.

Your JSON for a database of sheep-counting words will more likely look like this (I’ve just reduced it to two numbers in the sequence array rather than the original twenty). Each JSON document in our collection has an embedded array.

After a bit of thought, we remember that the OpenJSON function actually allows you to put a JSON value in a column of the result. This means that you just need to CROSS APPLY each embedded array, passing to the ‘cross-applied’ OpenJSON function the JSON fragment representing the array, which it will then parse for you.

I haven’t found the fact documented anywhere, but you can leave out the path elements from the column declaration of the WITH statement if the columns are exactly the same as the JSON keys, with matching case.

The ability to drill into sub-arrays by cross-joining OpenJSON function calls allows us to easily insert a large collection with a number of documents that have embedded arrays. This is looking a lot more like something that could, for example, tackle the import of a MongoDB collection as long as it was exported as a document array with commas between documents. I’ll include, with the download on GitHub, the JSON file that contains all the sheep-counting words that have been collected. Here is the updated stored procedure:

We can now very quickly ingest the whole collection into our table, pulling the data in from file. We include this file with the download on GitHub, so you can try it out. There are thirty-three different regions in the JSON file

We can now check that it is all in and correct

Giving …

Just as a side-note, this data was collected for this article in various places on the internet but mainly from Yan Tan Tethera. Each table was pasted into Excel and tidied up. The JSON code was created by using three simple functions, one for the cell-level value, one for the row value and a final summation. This allowed simple adding, editing and deleting of data items. The technique is only suitable where columns are of fixed length.

Importing a More Complex JSON Data Collection into a SQL Server Database

We have successfully imported the very simplest JSON files into SQL Server. Now we need to consider those cases where the JSON document or collection represents more than one table.

In any relational database, we can use two approaches to JSON data, we can accommodate it, meaning we treat it as an ‘atomic’ unit and store the JSON unprocessed, or we can assimilate it, meaning that we turn the data into a relational format that can be easily indexed and accessed.

  • To accommodate JSON, we store it as a CLOB, usually NVARCHAR(MAX), with extra columns containing the extracted values for the data fields with which you would want to index the data. This is fine where all the database has to do is to store an application object without understanding it.
  • To assimilate JSON, we need to extract all the JSON data and store it in a relational form.

Our example represents a very simple customer database with ten linked tables. We will first accommodate the JSON document by creating a table (dbo.JSONDocuments) that merely stores, in each row, the reference to the customer, along with all the information about that customer, each aspect (addresses, phones, email addresses and so on) in separate columns as CLOB JSON strings.

We then use this table to successively assimilate each JSON column into the relational database.

This means that we need parse the full document only once.

To be clear about the contents of the JSON file, we will be cheating by using spoof data. We would never have unencrypted personal information in a database or a JSON file. Credit Card information would never be unencrypted. This data is generated entirely by SQL Data Generator, and the JSON collection contains 70,000 documents. The method of doing it is described here.

We’ll make other compromises. We’ll have no personal identifiers either. We will simply use the document order. In reality, the JSON would store the surrogate key of person_id.

The individual documents will look something like this

We will import this into a SQL Server database designed like this:

The build script is included with the download on GitHub.

So, all we need now is the batch to import the JSON file that contains the collection and populate the table with the data. We will now describe individual parts of the batch.

We start out by reading the customersUTF16.json file into a variable.

The next step is to create a table at the document level, with the main arrays within each document represented by columns. (In some cases, there are sub-arrays. The phone numbers, for example, have an array of dates.) This means that this initial slicing of the JSON collection needs be done only once. In our case, there are

  • The details of the Name,
  • Addresses,
  • Credit Cards,
  • Email Addresses,
  • Notes,
  • Phone numbers

We fill this table via a call to openJSON. By doing this, we have the main details of each customer available to us when slicing up embedded arrays. The batch is designed so it can be rerun and should be idempotent. This means that there is less of a requirement to run the process in a single transaction.

Now we fill this table with a row for each document, each representing the entire date for a customer. Each item of root data, such as the id and the customer’s full name, is held as a column. All other columns hold JSON. This table will be an ‘accomodation’ to the JSON data, in that each row represents a customer, but each JSON document in the collection is shredded to provide a JSON string that represents the attributes and relations of that customer. We can now assimilate this data step-by-step

First we need to create an entry in the person table if it doesn’t already exist, as that has the person_id. We need to do this first because otherwise the foreign key constraints will protest.

Now we do the notes. We’ll do this first because it is a bit awkward. This has the complication because there is a many to many relationship with the notes and the people, because the same standard notes can be associated with many customers such an overdue invoice payment etc. We’ll use a table variable to allow us to guard against inserting duplicate records.

Addresses are complicated because they involve three tables. There is the address, which is the physical place, the abode, which records when and why the person was associated with the place, and a third table which constrains the type of abode. We create a table variable to support the various queries without any extra shredding.

Credit cards are much easier since they are a simple sub-array.

Email Addresses are also simple. We’re on the downhill slopes now.

Now we add these customers phones. The various dates for the start and end of the use of the phone number are held in a subarray within the individual card objects. That makes things slightly more awkward

Conclusion

JSON support in SQL Server has been the result of a long wait, but now that we have it, it opens up several possibilities.

No SQL Server Developer or admin needs to rule out using JSON for ETL (Extract, Transform, Load) processes to pass data between JSON-based document databases and SQL Server. The features that SQL Server has are sufficient, and far easier to use than the SQL Server XML support.

A typical SQL Server database is far more complex than the simple example used in this article, but it is certainly not an outrageous idea that a database could have its essential static data drawn from JSON documents: These are more versatile than VALUE statements and more efficient than individual INSERT statements.

I’m inclined to smile on the idea of transferring data between the application and database as JSON. It is usually easier for front-end application programmers, and we Database folks can, at last, do all the checks and transformations to accommodate data within the arcane relational world, rather than insist on the application programmer doing it. It will also decouple the application and database to the extent that the two no longer would need to shadow each other in terms of revisions.

JSON collections of documents represent an industry-standard way of transferring data. It is today’s CSV, and it is good to know that SQL Server can support it.