Processing Huge JSON Files with Jackson

Weird Al's not the only one who knows how to go large with Jackson

A couple of months ago I was asked to build a processor that would take a JSON file; perform a few elementary checks and transformations and upload the resulting records into a Couch DB. I hadn’t done any JSON, or Couch and, naturally, the whole thing had to be done yesterday – but at first glance it didn’t look like much of a challenge.

However, looking a little deeper, my processor was going to be part of a replication process across two databases housed in separate enterprises. The replication was going to be based on a daily snapshot – each JSON file would be a copy of the entire database – and a few back-of-the-envelope calculations suggested that the files may become rather large (40+GB) over time.

There was also a fairly tight processing window within which the upload had to be complete. Repressed memories of chasing down and remediating an old DOM based XML project plagued with OutOfMemoryExceptions and very slow processing times warned that some thought was required.

Obviously I wasn’t going to be able to be able to read the whole file into RAM to transform it and fortunately, I didn’t need to – the file was a collection of a couple of million simple objects with no cross references. So to minimize the amount of memory required by the processor, I simply needed to parse the large file into a stream of objects and transform them either individually or in small collections.

Jackson

Key “JSON Parsing” into your favourite search engine and you find about thirty different technologies competing for your attention. Also about a hundred different opinions as to which was “the best”. After a brief review, Jackson recommended itself though the tutorial which indicated that it included a Streaming API, as well as a DOM like Tree Model and JAXB style object bindings.

Given the memory and time concerns above, my first concept was to use the Streaming API to identify object boundaries within the input file to break it down into small chunks and to then use a JAXB style object binding to do the simple transformations required (mostly field name changes). The object binding piece was beautifully trivial:

public class Business {
    @JsonView({ BriefView.class, FullView.class })
    private String businessId;

    @JsonView({ BriefView.class, FullView.class })
    private String name;

    @JsonView({ FullView.class })
    private String address;

    ...
}

Using the JsonView concept, I was able to use the same value class to render different JSON requests for various external services supporting the checking and transformations required.

The Streaming API piece wasn’t going so smoothly. Although it was straight-forward to identify the object boundaries in the JSON token stream, there didn’t seem to be a natural (i.e. easy) way of accessing the underlying character stream in order to create the smaller work unit for the object bindings to use.  A re-think was necessary.

The object bindings were giving me an easy way of creating a collection of value objects in memory from a JSON stream – and conversely of creating a JSON stream from a collection of value objects. If I could get the Jackson object binding to create a stream of value objects, then I could dispense with the streaming API altogether. Some trawling through the Jackson API soon showed this wasn’t only possible, but easy:

JsonFactory f = new MappingJsonFactory();
JsonParser jp = f.createJsonParser(reader);

do {
    Business business = jp.readValueAs(Business.class);

    if (business == null) {
        handler.complete(context);
        break;
    }

    numberOfBusinessesReceived++;
    handler.handleBusiness(business, context);

} while (true);
jp.close();

The object handler here applies some cheap validation rules and aggregates a bunch of objects into a file for later (in memory) processing and upload to Couch. The context object carries some extra information – like the name of the file being processsed.

The magic provided by Jackson is the readValueAs method, which reads enough of the stream to create an object. Pushing a 10GB of two million records file through the processor running on an ordinary desktop with 256M of heap space yielded a processing time of 6 minutes – well within the required processing time window.\

Conclusion

The creators of Jackson have learnt the lessons from XML parsing well. Jackson offers a range of parsing approaches that allows developers to choose the right approach to the task at hand. In short, Jackson eats massive JSON files – without raising a sweat.

This entry was posted in Java and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s