Comma Separated JSON

Created 18th June, 2016 07:32 (UTC), last edited 21st June, 2016 13:05 (UTC)

Comma Separated JSON (CSJ) is a CSV like file format designed for stream processing where each cell is valid JSON. This makes it very similar to CSV, but without the problems that CSV has.

The problem with JSON is that to produce it you need to build a memory structure of everything you want to dump out, and to parse it you have to build everything in one go back into memory. This is fine for small JSON blobs, but isn't really ideal when the data consists of many mega bytes, or more.

XML solves this by having event based parsers that allow you to read sub-sections of the structure as they stream past. Kind of great, but who really wants to go back to XML?

CSV solves this in a different way. By having each line of data pretty much independent we can both generate and parse it one line at a time. This makes streaming it out and streaming it into things pretty painless. But the problem is that CSV isn't really a well defined file format with a well defined syntax.

  • What is the correct way to embed a new line? Do you expect \n or quoted text where the cell data actually goes into the next line?
  • What about embedded double quotes? Are they doubled up "" or slash escaped \"?
  • Are you going to parse the phone number +6686555123456 as 6.6865551e12?
  • What about dates and times? Actually, let's not even think about that.
  • If you have a single column of data, is the blank line at the end an empty cell and part of the data or not? What about blank lines in the middle?

All of these cases are solved in CSJ by using a JSON base to produce something that looks almost the same as CSV but without the parsing difficulties.

Below is a tiny CSJ file:

"name", "age", "job"
"Kirit Sælensminde", 45, "Minister Without  Portfolio"
"Freyja Sælensminde", 5, null

This looks almost exactly like the same CSV file would, and that's no accident. However we now know a few more things about the file. Following JSON's encoding rules the file is UTF-8. The strings are escaped using JSON rules so Unicode is simple to deal with. As are embedded new lines and double quotes.

Empty cells can now also be properly explicit making use of null. We would also have access to proper booleans. Never again get telephone numbers and actual numbers confused in processing.

Because it's a new format we can also be a bit stricter on specifying how we want a last tricky aspect to be dealt with:

  • All dates and times are ISO formatted
  • Times always have a time zone offset associated with them

Notes on consumption and production

Semantically a CSJ file is an array of JSON objects which share a common set of keys.

The media type for CSJ should be application/csj (following on from JSON's media type).

A single line from a CSJ file can be prepended with [ and suffixed by ] and run through a standard JSON parser.

Because it is JSON a cell doesn't have to be a JSON atom. You can embed JSON objects and arrays into a line and everything will work exactly as you'd expect with no ambiguity of how to produce or consume the data.

Like CSV, it can be produced with very low overhead. Our code that turns Postgres SQL statements into CSJ is able to stream the data over a HTTP connection at twice the speed that psql is able to stream the same SQL statement data into /dev/null. For my 32GB desktop producing JSON output for 10 million rows of half-a-dozen columns requires more RAM than I have. The RAM overhead to produce the CSJ data stream barely registers.

The example below includes embedded arrays:

"slug", "title", "released", "length_minutes", "created","tags", "watched__last", "watched__times"
"t1","Terminator","1984-10-26",null,"2016-06-11 08:54:09.006744+00",["adventure","action","dystopian","robots","time-travel","sci-fi"],"2016-06-11 08:55:31.54614+00",6
"t2","Terminator 2: Judgement Day","1991-07-01",94,"2016-06-11 08:54:12.895416+00",["adventure","action","dystopian","robots","time-travel","sci-fi"],null,null

It would be really cool if spreadsheets had a “CSJ export” option where we wouldn't suffer the sort of data corruption that all too often occurs

Now we'd never have to worry about a user loading a file into a spreadsheet, doing some work, and then corrupting half the data we need to process when it's saved.

Onward and upward

We've been introducing this in a few APIs that we've been building recently and are now in the process of building larger systems that use CSJ. Probably at some point in the near future we'll have a little JavaScript library that can be used to process server requests that return data in CSJ.

This is really a small part of a larger ecosystem that we're in the process of building out and you should be hearing more about in the coming months.


Categories: