Streaming Large JSON Deserialisation

22 October 2015, Rhodri Pugh

JSON is the current de facto data transport format for the web, and we use it internally and externally for various APIs we run. Options for libraries to encode and decode JSON in any language are plentiful, and generally the size of the JSON being used is reasonably small. Meaning the pattern for usage is much like…

1
2
3
<?php

$data = json_decode($jsonString);

ie. The whole thing is realised in memory at once. You can get right to work with it.

Object vs Stream

The above approach can be described as object based, both for serialising and deserialising. It’s simple, and the obvious choice for the majority of use-cases.

Some drawbacks though…

  • The data needs to be able to fit into memory all at once.
  • You need to spend the time deserialising the data before you can begin work (or someone else needs to wait for you to finish encoding it before they can begin).

An alternative model supported by some tools is to offer an API to interact with the JSON as a stream. The benefits really being the inverse of the drawbacks above…

  • Only a subset of the data needs to be in memory at a time (infinite streams!).
  • You (or other producers/consumers in the chain) can begin processing right away.

One of those libraries being…

Gson

Google’s Gson is a Java library that allows mapping to/from POJOs, and supports encoding/decoding JSON as a stream. The API is pretty straight-forward, given some JSON for example an array like this…

[
{"foo": 123},
{"bar": 456}
]

You can iterate these elements using…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
JsonReader reader = new JsonReader(new InputStreamReader(in, "UTF-8"));

reader.beginArray();

while (reader.hasNext()) {
reader.beginObject();

String key = reader.nextName(); // "foo"
int value = reader.nextInt(); // 123

reader.endObject();
}

reader.endArray();

It also provides the ability to specify custom type adapters and a bunch of other useful stuff, but back to our problem.

The Problem

As nice as the streaming API in Gson is, we have one service that consumes file uploads, described in JSON like this…

1
2
3
4
5
6
7
[
{
"name": "Name of the file.pdf",
"type": "application/pdf",
"content": "base64 encoded content of the file"
}
]

As you can imagine, the base64 encoded content of the file can be pretty big. Not big data big of course, but all memory is finite, and when you’re trying to run a service with as little as possible you can’t do things like load arbitrary blobs like this…

1
2
reader.nextName(); // "content"
reader.nextString(); // Out of memory! Do not pass go, do not collect £200...

The Solution

Our solution to handling these large values was to offload the data to disk for these keys, with a new method on the reader object called nextStream. This method, instead of returning the value of the key as a particular type like the others do, takes an OutputStream to write the content of the key into.

1
2
3
4
File file = File.createTempFile("some-name", "ext");
FileOutputStream fos = new FileOutputStream(file);

reader.nextStream(fos); // Success!

This stream could of course write to disk, to a network share, to an S3 bucket, whatever… The important part being only the relatively small names and content types need to be held in memory.

We also use commons-codec to deserialise the data as it comes in. Java streams really work nicely sometimes!

1
2
3
4
FileOutputStream fos = new FileOutputStream(file);
Base64OutputStream bos = new Base64OutputStream(fos, false);

reader.nextStream(bos);

This leaves the data decoded on the filesystem ready to be streamed into the next step of data processing. We’ve proposed this as an addition to Gson…

https://github.com/google/gson/pull/718

Streaming FTW

I’ve written about handling data as streams before. It’s a powerful technique that fits well with the functional, handy bite-size approach I like to use when decomposing problems (so that they can fit into my tiny little brain!).

Hope this was interesting.