Tools for JSON and JSON Lines

Here are some tools and programming language libraries to work with JSON and JSON Line formats, including the files used by NCBI data packages.

Tools for JSON and JSON Lines

Here are some tools and programming language libraries to work with JSON and JSON Line formats, including the files used by NCBI data packages.

How do I convert NCBI datasets reports from JSON and JSON Lines formats into tabular formats?

In addition to datasets, NCBI provides the dataformat command line tool. Use this tool to convert NCBI datasets reports from JSON and JSON Lines formats into other formats. See our guide, Working with data reports, for some examples.

How do I reformat JSON and JSON Lines data to make the output easier to read?

A commonly used tool for processing JSON and JSON Lines is jq. By default, the tool will pretty-print its output, reformatting the result with indentation and colors to improve readability.

To pretty-print input without other transformations, invoke it without arguments as part of a Unix shell pipeline, or invoke it with two arguments, giving the command . (identity transformation) and an input filename. For example:

$ cat ncbi_dataset/data/data_report.jsonl | jq
$ jq < ncbi_dataset/data/data_report.jsonl
$ jq . ncbi_dataset/data/data_report.jsonl

How do I process arbitrary JSON and JSON Lines data on the command line?

There are many tools for processing JSON and JSON Lines data. One commonly used tool is jq. This tool is open source, available for many platforms, often included in modern Linux distributions, and described as the JSON equivalent of grep, sed, and awk. Note that although jq is reasonably easy for simple transformations, it may be exceptionally difficult to grasp for even moderately complex operations. Additional tools can also be found by consulting lists such as the Awesome JSON curated list of JSON tools and libraries.

In addition to jq, you may use various tabular and multi-format tools for conversions and transformations. Some of these include:

  • xsv : Supports CSV/TSV.
  • csvtk : Supports CSV/TSV; limited export of JSON.
  • csvkit : Supports CSV/TSV; limited import/export of JSON.
  • Miller : Supports JSON, JSON Lines, CSV/TSV, and other formats.
  • datamash : Supports TSV.

Alternatively, you may also search the web for combinations of terms such as JSON jq grep sed awk to find guidance on basic tools for processing JSON and JSON Lines.

How do I view JSON and JSON Lines data interactively?

If you prefer a more visual and interactive way of viewing JSON and JSON Lines data, you may want to use other tools such as:

  • Dadroit JSON Viewer: a private-party graphical JSON exploration tool. The vendor offers a free license for non-commercial use;
  • VisiData: an open-source text-based interactive data exploration tool.
  • A programmer’s code editor, several of which support JSON and JSON Line pretty printing with code folding to collapse/expand nested structures.
  • One of the many other open source or commercial JSON viewers. Note thatsome tools do not support JSON Lines format, despite the very close similarity to JSON.
The examples illustrated below assume the downloaded reports from Working with data reports.

Using Dadroit JSON Viewer, open a JSON Lines data report to show the report as a collapsed tree. Click on the + plus symbol to expand any of the nodes and view the contents.

In this example, you can see the genomic range (location on the genome) for the human alpha-2-macroglobulin gene, on one of the assemblies where it is placed: Dadroit JSON Viewer showing gene data report

Using VisiData, open a JSON Lines data report to see a tabular view. The example shows a gene data report with a set of genes. Move your cursor down to the row for the human alpha-2-macroglobulin gene, and then use the drill-down and record expansion features to view the nested content of the annotations and genomic locations columns. VisiData showing the initial view on opening the gene data report and selecting the second row VisiData showing a drill-down into the annotations column VisiData showing genomic locations after drill-down

How do I process JSON and JSON Lines data in programming languages such as Python?

Since JSON is a widely adopted format, most modern programming environments include support, often as part of native language libraries.

JSON Lines is less widely adopted; however, the format consists of one JSON value per line, and therefore, is easily parsed with JSON libraries. For example, in Python:

import json

with open('some_file.jsonl') as my_file:
    for line in my_file:
        my_value = json.loads(line)

Some third-party Python libraries that support JSON Lines include:

  • jsonlines: provides basic JSON Lines input/output.
  • Pandas: a data science tool. Use pandas.read_json() and pandas.DataFrame.to_json(), specifying lines=True for JSON Lines.
  • Apache Spark: another data science tool. By default, read.json() expects JSON Lines rather than JSON. Set the multiline option to true when reading regular JSON which may be pretty-printed across multiple lines.
Generated May 16, 2024