Recommended way to save and read DataFrames in JSON format

I need to save some tabular data in JSON format (from a game engine) to read it in Julia as a DataFrame (or IndexedTable). I’m trying to understand what is the best way to save this kind of data in JSON. The following two options make sense to me:

  • As a dict of columns:
{"x":[1,2,3],"y":[4,5,6]}
  • As a list of dicts separated by newlines to be read as named tuples:
{"x":1,"y":4}
{"x":2,"y":5}
{"x":3,"y":6}

I believe the second option can probably be loaded very efficiently on Julia 0.7 with JSON2 (though I’m not sure) and seems more amenable to implementing the DataStreams interface.

My question is: how should one save tabular data in JSON and what would be the most efficient/recommended way to load it in Julia (as a named tuple of vectors, or a DataFrame or an IndexedTable)?

I’d be inclined to go w/ the first option (Dict of columns), since that would deserialize (w/ JSON2 at least) as a NamedTuple of vectors. I have some ideas of getting better typing from parsing arrays too (currently if the vector types aren’t provided, they’re assumed as Any).

I can add a DataStreams interface implementation to JSON2 pretty easily. (issue created: https://github.com/quinnj/JSON2.jl/issues/4)

2 Likes

Thanks a lot! That’d be very useful.

Regarding JSON parsing, my first intuition was the Dict of columns as well, but I was a bit surprised to see that it loads as a NamedTuple of Vector{Any}. It’d be really nice if JSON2 could somehow detect the type (at least for simple cases: Float64, Int, String etc…).

Slightly offtopic. What’s the difference between JSON And JSON2 packages and when would we use one over the other?

My understanding is that JSON2 is more optimized but as it relies heavily on NamedTuples it is only available on Julia 0.7

… but I was a bit surprised to see that it loads as a NamedTuple of Vector{Any}

You might find LazyJSON.jl helpful. It takes a different approach to parsing. Instead of allocating new Julia objects NamedTuple, Vector, Dict etc, it implements abstract interfaces that allow the JSON values to be accessed in-place.

Note, LazyJSON.jl is new and not widely tested, so it may not be suitable for your purpose. But, please try it out and report any issues you find.

e.g.

julia> j = LazyJSON.value("""{"x": [1,2,3,4,5], "y": [2,4,6,8,10]}""")
LazyJSON.Object

julia> j["x"]
LazyJSON.Array

The LazyJSON.Object and LazyJSON.Array objects don’t store any data, they are just interface shims that know the index of the first byte of the JSON value. No parsing has been done yet.
From here you can convert to whatever type you like, the values are parsed as they are pulled in by the constructor for the new type.

julia> convert(Vector{Int}, j["x"])
5-element Array{Int64,1}:
 1
 2
 3
 4
 5

With DataFrames you could do something like this:

julia> d = DataFrame(x = Int[], y = Int[])
0×2 DataFrame


julia> for v in zip(j["x"], j["y"])
           push!(d, v)
       end

julia> d
5×2 DataFrame
│ Row │ x │ y  │
├─────┼───┼────┤
│ 1   │ 1 │ 2  │
│ 2   │ 2 │ 4  │
│ 3   │ 3 │ 6  │
│ 4   │ 4 │ 8  │
│ 5   │ 5 │ 10 │

This way the values are parsed directly from the JSON text into the DataFrame without needing an intermediate representation.

1 Like

I know that JSONlite in R uses the first approach by row. I prefer the row-dictionaries like this, though bulkier, because that’s what I almost always see coming from clients’ producer systems. They are filtering and outputting rows.

It’s easier to read, and easier to edit or detect errors, versus lining up large columns.

Ideally a fromJSON function would detect the orientation, and a toJSON function would have an option to choose the orientation.

Lastly, the rows approach is the only one that will be amenable to nested content, which is, for many, the whole point of using JSON instead of a CSV.

The reason I asked in the first place is that a Google Platform service I considered using wants the row by row format to upload data: Charger des données JSON à partir de Cloud Storage  |  BigQuery  |  Google Cloud

OTOH with JSON2 I guess one should be able to give explicitly the NamedTuple type and read row by row. It’d be ideal if this could be integrated in say IterableTables or some other mechanism to turn NamedTuple iterators in DataFrames.