Recommended way to save and read DataFrames in JSON format

piever · February 11, 2018, 12:38pm

I need to save some tabular data in JSON format (from a game engine) to read it in Julia as a DataFrame (or IndexedTable). I’m trying to understand what is the best way to save this kind of data in JSON. The following two options make sense to me:

As a dict of columns:

{"x":[1,2,3],"y":[4,5,6]}

As a list of dicts separated by newlines to be read as named tuples:

{"x":1,"y":4}
{"x":2,"y":5}
{"x":3,"y":6}

I believe the second option can probably be loaded very efficiently on Julia 0.7 with JSON2 (though I’m not sure) and seems more amenable to implementing the DataStreams interface.

My question is: how should one save tabular data in JSON and what would be the most efficient/recommended way to load it in Julia (as a named tuple of vectors, or a DataFrame or an IndexedTable)?

quinnj · February 11, 2018, 5:18pm

I’d be inclined to go w/ the first option (Dict of columns), since that would deserialize (w/ JSON2 at least) as a NamedTuple of vectors. I have some ideas of getting better typing from parsing arrays too (currently if the vector types aren’t provided, they’re assumed as Any).

quinnj · February 11, 2018, 5:21pm

I can add a DataStreams interface implementation to JSON2 pretty easily. (issue created: https://github.com/quinnj/JSON2.jl/issues/4)

piever · February 11, 2018, 5:37pm

Thanks a lot! That’d be very useful.

Regarding JSON parsing, my first intuition was the Dict of columns as well, but I was a bit surprised to see that it loads as a NamedTuple of Vector{Any}. It’d be really nice if JSON2 could somehow detect the type (at least for simple cases: Float64, Int, String etc…).

tk3369 · February 11, 2018, 5:53pm

Slightly offtopic. What’s the difference between JSON And JSON2 packages and when would we use one over the other?

piever · February 11, 2018, 6:27pm

My understanding is that JSON2 is more optimized but as it relies heavily on NamedTuples it is only available on Julia 0.7

samoconnor · February 12, 2018, 1:57am

… but I was a bit surprised to see that it loads as a NamedTuple of Vector{Any}

You might find LazyJSON.jl helpful. It takes a different approach to parsing. Instead of allocating new Julia objects NamedTuple, Vector, Dict etc, it implements abstract interfaces that allow the JSON values to be accessed in-place.

Note, LazyJSON.jl is new and not widely tested, so it may not be suitable for your purpose. But, please try it out and report any issues you find.

e.g.

julia> j = LazyJSON.value("""{"x": [1,2,3,4,5], "y": [2,4,6,8,10]}""")
LazyJSON.Object

julia> j["x"]
LazyJSON.Array

The LazyJSON.Object and LazyJSON.Array objects don’t store any data, they are just interface shims that know the index of the first byte of the JSON value. No parsing has been done yet.
From here you can convert to whatever type you like, the values are parsed as they are pulled in by the constructor for the new type.

julia> convert(Vector{Int}, j["x"])
5-element Array{Int64,1}:
 1
 2
 3
 4
 5

With DataFrames you could do something like this:

julia> d = DataFrame(x = Int[], y = Int[])
0×2 DataFrame


julia> for v in zip(j["x"], j["y"])
           push!(d, v)
       end

julia> d
5×2 DataFrame
│ Row │ x │ y  │
├─────┼───┼────┤
│ 1   │ 1 │ 2  │
│ 2   │ 2 │ 4  │
│ 3   │ 3 │ 6  │
│ 4   │ 4 │ 8  │
│ 5   │ 5 │ 10 │

This way the values are parsed directly from the JSON text into the DataFrame without needing an intermediate representation.

pasha · February 12, 2018, 6:11pm

I know that JSONlite in R uses the first approach by row. I prefer the row-dictionaries like this, though bulkier, because that’s what I almost always see coming from clients’ producer systems. They are filtering and outputting rows.

It’s easier to read, and easier to edit or detect errors, versus lining up large columns.

Ideally a fromJSON function would detect the orientation, and a toJSON function would have an option to choose the orientation.

Lastly, the rows approach is the only one that will be amenable to nested content, which is, for many, the whole point of using JSON instead of a CSV.

piever · February 12, 2018, 6:33pm

The reason I asked in the first place is that a Google Platform service I considered using wants the row by row format to upload data: Charger des données JSON à partir de Cloud Storage | BigQuery | Google Cloud

OTOH with JSON2 I guess one should be able to give explicitly the NamedTuple type and read row by row. It’d be ideal if this could be integrated in say IterableTables or some other mechanism to turn NamedTuple iterators in DataFrames.

Topic		Replies	Views
DataFrames, best way to import from JSON format file Data	6	8527	October 15, 2019
DataFrame or DataFrameRow to Dict and JSON Data	1	2102	August 27, 2019
JSON to Array of DataFrames General Usage question	4	422	November 22, 2019
JSON Performance Tests Data	7	2306	November 6, 2018
[ANN] JDF.jl - Experimental Julia DataFrames serialization format Package Announcements	3	1441	January 19, 2020

Recommended way to save and read DataFrames in JSON format

Related topics