In my role as a Data Engineer, I have an ever-increasing need for Querying JSON documents using the CLI (command line interface) in these ways:
- Filtering the output from a cloud CLI to just the useful information
- Extracting the output from an API
- Evaluating JSON documents to ingest them into a database.
- Reformatting information to submit to an API
- Software/data archaeology
For the use cases above uploading JSON to a DB is overkill or premature, so the jq command line utility allows me to query JSON documents from the command line. This tool is written in C and you can install it on Windows, Linux, or Mac OSX.
Installing jq
The Windows installation is the most involved as it depends on how you intend to run it. As I use Mac OSX and Linux I can only offer approximate guidance on this. I have a tool called Parallels Desktop on my Mac, which allows me to run Windows 11.
To install jq I used the steps outlined below.
- Downloaded jq-win64.exe from the most recent releases on https://github.com/jqlang/jq/releases. Depending on your machine processor you may have to choose jq-windows-amd64.exe
- Rename the downloaded exe to jq.exe
- Create a folder for your app. I used C:\program files\jq
- Copy jq.exe to your new folder
- Open up your Windows system properties/environment variables and add your new folder to the PATH environment variable
For MacBooks it is just a case of using the homebrew package manager
brew install jq
To check that everything is working start a new command shell window and type
jq --version # This should bring back something similar to # jq-1.7.1
Helpful resources to help you learn jq
- Official manual for jq
- JQ Play provides an online test pad for you develop jq expressions
- JQ Recipes. Two years old but some useful examples for some of the more powerful jq functions
Using jq to pretty print a JSON file
Let us suppose we extract a small fragment of json from an application log. As you can see it isn't easy to absorb what it is telling you though some of you might recognise it as part of an AWS S3 PUT event.
"s3": {"s3SchemaVersion": "1.0","configurationId": "testConfigRule","bucket": {"name": "example-bucket","ownerIdentity": {"principalId": "EXAMPLE"},"arn": "arn:aws:s3:::example-bucket"},"object": {"key": "test0.000000key","size": 1024,"eTag": "0123456789abcdef0123456789abcdef","sequencer": "0A1B2C3D4E5F678901"}}}
For the sake of these examples, rather than pipe the results of a cloud command line call into jq we are going to store the fragment in a file called logfragment.json. On Windows you can type the following into your command window
type logfragment.json|jq # or as an alternative jq '.' logfragment.json
This will pretty print your logfragment.json file. As you can see it is more readable.
{ "s3": { "s3SchemaVersion": "1.0", "configurationId": "testConfigRule", "bucket": { "name": "example-bucket", "ownerIdentity": { "principalId": "EXAMPLE" }, "arn": "arn:aws:s3:::example-bucket" }, "object": { "key": "test0.000000key", "size": 1024, "eTag": "0123456789abcdef0123456789abcdef", "sequencer": "0A1B2C3D4E5F678901" } } }
Using JQ to strip down a JSON file
JSON documents can be verbose which, for diagnostic purposes, we may need only a handful of attributes.
Look at our example logfragment.json file, we are only be interested in three attributes
- The bucket name
- The key within the bucket
- The size of the object referenced by the key.
The jq application allows us to select just the items we are interested in. In the example below we take those three attributes and render them in a far simpler document.
jq '.s3|{"name":.bucket.name, "key":.object.key, "size":.object.size}' logfragment.json
The pipe symbol allows us to chain jq operations together.
- Extract the s3 node. Below this node
- Assign a key called "name" and get the value of the relative node path .bucket.name
- Next, assign a key called "key" and get the value of the relative node path .object.key
- Finally, assign a key called "size" and get the value of the relative node path .object.size
We could have written our jq query in referencing the full path of the attributes as follows.
jq '{"name":.s3.bucket.name, "key":.s3.object.key, "size":.s3.object.size}' logfragment.json
The document produced from either form is shown below.
{ "name": "example-bucket", "key": "test0.000000key", "size": 1024 }
We don't have to produce output in JSON format, we could have asked for just the values
# The default is to wrap strings in double-quotes jq '.s3|.bucket.name, .object.key, .object.size' logfragment.json # If we just wanted the "raw" values jq -r '.s3|.bucket.name, .object.key, .object.size' logfragment.json
Using jq to produce a CSV file
We can ask jq to produce our output in CSV format and to do so the @csv operator our input must be an array. As we have a single record we have to wrap up our extracted fields within square brackets to convert them into a one line array.
# Just the csv data row jq -r '[.s3|.bucket.name, .object.key, .object.size]|@csv' logfragment.json # CSV with headers jq -r '["name","key","size"],[.s3|.bucket.name, .object.key, .object.size]|@csv' logfragment.json
In the 2nd example above we are creating a UNION of 2 arrays.
- First array member is our column names
- Second array member is our data
Using jq to find the JSON keys at a particular level
When I have a large JSON object my starting point is to look at what keys exist. I use this to give me an indication of the objects that are contained in the JSON document. For our logfragment.json we scarcely need such a technique but the technique is as follows.
jq '.s3|keys' logfragment.json # This will produce the following # # [ # "bucket", # "configurationId", # "object", # "s3SchemaVersion" # ]
Using jq to extract from an array
So far we have simply flattened and simplified a small JSON fragment though you can apply the same technique to a much larger document. What we will try next is to extract data from an array and its parent. For this I will use a simple document, which I will store in a file called orders.json, purely to illustrate the concept.
{ "orderId": 123, "orderDate": "2024-01-20", "orderItems": [ { "productCode": "KYB00001", "price": 102.50, "quantity": 1, "discountPercentage": 0.0 }, { "productCode": "CBL00002", "price": 2.50, "quantity": 100, "discountPercentage": 10.0 } ], "customerId":"7a27135f-4b2f-4ade-8754-2b001639418b" }
We want the order header information to repeat for each order item and perform a simple calculation or the total order item price. We can do this with the following jq expressions
jq -r '.customerId as $custId|.orderId as $Id|.orderDate as $PurchaseDate| [ "CustomerNumber", "OrderNumber", "OrderDate", "ProductCode", "UnitPrice", "Quantity", "ItemPrice" ], (.orderItems[]| [ $custId, $Id, $PurchaseDate, .productCode, .price, .quantity, (.price * .quantity * (1 -.discountPercentage /100)) ] ) |@csv' orders.json
Our jq expression above illustrates some interesting concepts.
- A jq expression can be split over many lines to aid readability
- We can assign a value to a variable at the parent level to be used later
- We can perform calculation
Our output will appear as follows
"CustomerNumber","OrderNumber","OrderDate","ProductCode","UnitPrice","Quantity","ItemPrice" "7a27135f-4b2f-4ade-8754-2b001639418b",123,"2024-01-20","KYB00001",102.50,1,102.5 "7a27135f-4b2f-4ade-8754-2b001639418b",123,"2024-01-20","CBL00002",2.50,100,225
Building your library of jq expressions
So far the jq expressions I have shown you have been typed in long hand though, just like some of the useful SQL queries a DBA has at their disposal, it would be handy if these could be stored in files. Fortunately, just as sqlcmd has the -i switch, jq has -f or --from-file.
Store everything within, but excluding, the single quotes in the previous jq expression in a file called order.jq. Now you can run the following and get the same result as before.
jq -rf orders.jq orders.json
The first file contains our jq expression, the 2nd contains the json we wish to run it against.
Think of a jq expression stored in a file as being akin to a stored procedure such as the sp_blitz procs SQL Server DBAs know and love.
Again, just as you would pass arguments to SQLCMD using the -v switch, jq as the --arg and even --argjson switch.
Closing thoughts
There are times when ingesting JSON into a database is absolutely the right thing to do. The jq utility is useful when what we need a quick and dirty extraction.
I have only given you the basics of what jq can do. Believe me, this barely scratches the surface though for the more complicated interactions I think SQL Server JSON functionality may be more appropriate.
There is a yq utility too which can handle JSON, YAML and XML. For JSON processing it doesn't support everything that jq can though, depending on your use case, it may fit your needs. As there are a lot of web and database tools that use large YAML files for configuration I have found it similarly useful.
One use case I should mention is the use of jq/yq syntax when putting together ETL mapping specifications where a data source is a JSON source. This is an subject on which I hope to write more at a later date.
For Python fans both the pyjq library and yq library allow Python applications can use jq and/or yq.
Where these utilities really come into their own is when using them in shell scripting (bash, powershell etc). This is especially true in resource constrained environments such as GitHub actions where small, efficient, but powerful, tooling is a Godsend.