Search

Top 60 Oracle Blogs

Recent comments

Demystifying JSON with CockroachDB… Import, Index, and Computed Columns

Overview

Recently, I created and delivered an "Advanced Developer Workshop" for CockroachLabs. One of the topics dove into how to ingest and use JSON data.

Like many databases, CockroachDB has the ability to use JSON data type for columns within a table. Basically, you insert a JSONB object into a row, and then can filter and extract the desired data with SQL. The following simple example shows how this is done:

So it is pretty straight forward to use JSONB objects within tables, but how do you load those HUGE json files into CockroachDB?

While I loaded a 300MB JSON file into a SINGLE COLUMN/ROW as a test, it is not a good idea for performance and usability. It is best to flatten your data.

Flattening JSON data

Often, I find that some JSON data can contain ENTIRE databases in as a single object. Consider the following example:

This is a simple example, but with huge files this could be problematic and not perform as desired. To flatten this JSON, two TSV files should be created and loaded directly into separate tables.

Most languages have JSON libraries these days, but I created a simple Python script to show how this is done.

Importing JSON files

As I mentioned in my previous blog regarding import, there are multiple sources that can be used to import data.

CockroachDB supports the ability to access files stored in cloud storage buckets such as Amazon S3, Azure, and Google Cloud.  Additionally, CockroachDB allows for data to be imported from various other URL data services:  http, nfs, and s3 compatible services. 

While searching the web, I found several good examples via github, that have already mostly flattened data sets. This companies.json dataset has various financial information regarding global companies. The following example imports data and begins to explore the data:

As, I continue to drill down, I run into a problem when trying to calculate the top 10 companies for acquisitions and have to CAST the values accordingly.

By dynamically casting values, this slows down the calculations. Aggregate functions are not allowed with JSONB values and can not benefit from optimizations like Vectorized queries. Luckily, we can use computed columns. Computed columns actually store the computed values in a regular column. CockroachDB allows you to simply alter a table on the fly if needed to add these columns. See the example below:

Using computed columns allows us to achieve the best possible performance when calculating using JSON data types. Additionally, it allows us to use traditional indexes to speed up the calculations even further. Look at the following:

Inverted Indexes

Just like GIN indexes in Postgres, CockroachDB allows you to create inverted indexes to search JSON data. This can greatly improve queries when filtering data. Consider the following example:

Final Thoughts

JSON is a powerful data source that can be integrated nicely into CockroachDB. Flatten the data and STORE the known computed columns as needed or add them on the fly as new use cases are discovered. Avoid storing JSON values that are huge, but if you must consider using a separate table or at least a column family to isolate the large column.