wrangling large json files

Dear All,

I’m handling a large JSON file (in my case, a 100GB of ndjson). I would need to query it in various and still TBD ways. In my lucky case we can rely on a fixed schema, which is nested (although not deep, we got only three levels down).

I could transform this into a database or a csv, e.g., via jq. But I’d prefer to stay native and explore it in Julia.

One of the hand goal would be to build a graph out of the data.

How would you do this?

I believe the standard JSON.jl package will not work, unless you have huge amounts of RAM, since it creates a complete dictionary of the JSON object. I haven’t tried the other JSON packages (JSON2.jl and Json2.jl on Github).

It might be worthwhile to have a look at LazyJSON it description here of seeming to only parse the parts that you need seems to match your requirement to be able to operate on 100GB files:

I’m assuming there are many entries on each level of the hierarchy. I would think that what you need conceptually is to maintain pointers to the start (and maybe end) of each level of the hierarchy you enter in a depth first type manner, then you can try to minimise what you need to reprocess when you want to jump to a different part. That might be a reusable library built from parts of the different JSON libraries if that isn’t what LazyJSON already provides.
If the number of elements at the top layer is not many, it might be worthwhile to split it into a few separate JSON files, which each then only contain 2 levels of the hierarchy, making parts easier to manage.

2 Likes

I don’t think any of Julia’s JSON packages currently support ndJSON—it would probably be a simple change, but worth keeping in mind that they probably won’t work for you out of the box.

1 Like

Thank you both!

So far the new line delimitation did not seem a huge problem. Something like the following would behave as expected returning a new line at each iteration:

using JSON
using DataFrames

file_json = open("my_file.ndjson", "r")

file_json |>
  JSON.parse |>
  DataFrame

close(file_json)

Quite nicely, the DataFrame does not get upset by the present of nested json in some of the columns.

So, one solution that would replicate the jq workflow would be to save each line on disk as I go through the lines of the ndjson — e.g., using CSV.write(...; append = true) — and feeding that to JuliaDB with loadtable() (it should work, right?).

P.S. @lwabeke I’m not sure I understood correctly what you mean with “many entries on each level of the hierarchy”. The data is almost rectangular (fixed schema, 29 fields for each row, some million rows).

Remember standard JSON is a tree hierarchy with objects inside objects.
What I meant was that if you look at the JSON as a tree structure, the way to optimise might be different based on the branching factor at each level. It sounds like you have a big branching factor at the root and then a low factor at the final leave level.
That is different from a struct with 29 fields, with each field being an array of millions of values. In which case it might be very useful to store pointers to the start of each array, which would allow you to jump between parts of the JSON file and reparse, without keeping everything in RAM.
For your case, keeping a pointer to each row is still going to cost a lot memory. Depending on the types of queries and whether the rows are sorted in any way, keeping the pointers might still be quicker than reparsing everything again.

I recently wrapped the YAJL JSON library here which seems like it might be useful for your use case (YAJL supports multiple root objects and also streams its input). However, you’d have to write the parser yourself.

1 Like