Efficiently Read JSON and Create DataFrame

I am taking in a simulation from another process written in existing software. It outputs a JSON package of a few tables (stored as array of objects). I am running an optimization in JuMP based on the simulation. The optimization runs in a few seconds but reading the JSON and converting into a DataFrame takes a long time. Specifically converting the large simulation into a DataFrame.

I’m hoping that I am doing something inefficiently that can be improved.

function read_json(file)
    open(file,"r") do f
        global inDict
        inDict = JSON.parse(f)
    end
    return inDict
end
inDict = read_json(file)

println("Creating Data Frames")
simstates = vcat(DataFrame.(inDict["simstates"])...)

After calling the function ran a benchmark on the read

julia> @benchmark inDict = read_json()
BenchmarkTools.Trial:
memory estimate: 1.79 GiB
allocs estimate: 29547484

minimum time: 13.738 s (8.85% GC)
median time: 13.738 s (8.85% GC)
mean time: 13.738 s (8.85% GC)
maximum time: 13.738 s (8.85% GC)

samples: 1
evals/sample: 1

Then a benchmark on the conversion to DataFrame

@benchmark simstates = vcat(DataFrame.(inDict[β€œsimstates”])…)
BenchmarkTools.Trial:
memory estimate: 3.95 GiB
allocs estimate: 65015493

minimum time: 26.722 s (8.33% GC)
median time: 26.722 s (8.33% GC)
mean time: 26.722 s (8.33% GC)
maximum time: 26.722 s (8.33% GC)

samples: 1
evals/sample: 1

So about 40 seconds to read in Simulation data. After that it takes about 5 seconds to do the optimization.

simstates is 765000Γ—6 DataFrame

This is actually a small-ish simulation. I expect a production run to be a multiple of this.

What happens to the times if you replace read_json with

function read_json(file)
    open(file,"r") do f
        return JSON.parse(f)
    end
end

You also might want to try JSON3 GitHub - quinnj/JSON3.jl

1 Like

Same, about 13 seconds.

I’ll take a look at JSON3.

If it helps, JSON is structured like so:

{ 
    "var1": 12.34,
    "var2": 12.34,
    "current": [
        {
            "ID": 1,
            "var3": 12.34,
            "var4": 12.34
        },
        ...
    ],
    "potential": [
        {
            "ID": 10,
            "var3": 12.34,
            "var4": 12.34
        },
        ...
    ],
    "simstates": [
        {
            "simulation": 512,
            "date": "2020-12-31",
            "simvar1": -0.013495534501,
            "simvar2": -0.013495534501,
            "simvar3": 0.013495534501,
            "ID": 1
        },
        ...
    ]
}

Hi

There are a few JSON packages available: JSON.jl, JSON2.jl, JSON3.jl, LazyJSON.jl are ones I’m aware of.

See [ANN] JSON3.jl - Yet another JSON package for Julia
for some details.

My summary:

  • JSON.jl - The original and does proper handling, but slow and very memory intensive. Uses parse to parse into a dict
  • LazyJSON.jl - Memory efficient, which makes it relatively fast. Parse also gives a β€œdict” type interface.
  • JSON2.jl & JSON3.jl - Claims to be fast. Parses into a provided type. At some point I looked at them and got the impression one (or both?) was cheating a bit: It assumed the JSON fields would match the sequence of fields defined in the structure (to speedup, it didn’t check if label matched). This works if the JSON string is created by the same package, but could break if done by another encoder. I’m not sure if this is still the case and/or if I just misunderstood the code.

I would assume LazyJSON would shine if you only need to access a subset of the JSON, but as part of the testing of the Unmarshal.jl package I got the impression that even if I unmarshal the whole object LazyJSON could still outperform JSON, but it depends on the size and complexity of the structure.

The Unmarshal.jl package can be used to convert from the JSON.jl and LazyJSON.jl dict interface to a Julia type object, which might be an alternative to what your doing in:

simstates = vcat(DataFrame.(inDict[β€œsimstates”])…)

It is however focused on functionality and not really performance, in particular since working with the original JSON.parse, it seemed the JSON.parse dominated timing compared to the Unmarshal step.

Your descriptions of the various JSON packages are mostly correct, but I can add some additional color:

  • JSON.jl: you’re correct, it’s the old standard, but doesn’t try to do anything fancy w/ performance
  • LazyJSON.jl: completely lazy, but has a surprisingly high fixed cost just to crazy the LazyJSON.Object object, which I found could be prohibitive for really small objects. The biggest disadvantage is the package is basically unmaintained; no real commits in years and I suspect the code is getting pretty stale now
  • JSON2/JSON3.jl: JSON2.jl is effecitvely archived at this point and shouldn’t be used. JSON3.jl is the successor there and is indeed fast (this benchmarks the OP’s json data they posted in this thread):
julia> @btime JSON3.read(json);
  897.000 ns (2 allocations: 4.34 KiB)

julia> @btime JSON.parse(json);
  2.920 ΞΌs (61 allocations: 3.66 KiB)

It takes a hybrid lazy approach to avoid allocating when it doesn’t need to, making it very efficient when you only need a subset of the object.

There’s also the JSONTables.jl that makes the tables-to-json and back experience more convenient. That is, you can do df = DataFrame(jsontable(json_string)) and it just works.

3 Likes

There’s also the JSONTables.jl that makes the tables-to-json and back experience more convenient. That is, you can do df = DataFrame(jsontable(json_string)) and it just works.

I was looking at this the other day, how would that work given I have multiple tables inside the JSON string. That is, how could I easily use JSONTables.lj to pull out the β€œcurrent,” β€œpotential,” and β€œsimstates” tables? Is that even possible with that library?

Ah, I didn’t realize each of those variables was a table. JSONTables.jl is more geared towards when your entire file/json is an array of objects or object of arrays. In your case, you can just do:

julia> x = JSON3.read(json)
JSON3.Object{Base.CodeUnits{UInt8,String},Array{UInt64,1}} with 5 entries:
  :var1      => 12.34
  :var2      => 12.34
  :current   => JSON3.Object[{…
  :potential => JSON3.Object[{…
  :simstates => JSON3.Object[{…

julia> cur = DataFrame(x.current)
1Γ—3 DataFrame
β”‚ Row β”‚ ID    β”‚ var3    β”‚ var4    β”‚
β”‚     β”‚ Int64 β”‚ Float64 β”‚ Float64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1     β”‚ 12.34   β”‚ 12.34   β”‚

julia> pot = DataFrame(x.potential)
1Γ—3 DataFrame
β”‚ Row β”‚ ID    β”‚ var3    β”‚ var4    β”‚
β”‚     β”‚ Int64 β”‚ Float64 β”‚ Float64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 10    β”‚ 12.34   β”‚ 12.34   β”‚

julia> sim = DataFrame(x.simstates)
1Γ—6 DataFrame
β”‚ Row β”‚ simulation β”‚ date       β”‚ simvar1    β”‚ simvar2    β”‚ simvar3   β”‚ ID    β”‚
β”‚     β”‚ Int64      β”‚ String     β”‚ Float64    β”‚ Float64    β”‚ Float64   β”‚ Int64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 512        β”‚ 2020-12-31 β”‚ -0.0134955 β”‚ -0.0134955 β”‚ 0.0134955 β”‚ 1     β”‚

And that should be pretty efficient.

Thanks for the help. Unfortunately DataFrame(x.simstates) throws an out of memory error and then all hell breaks loose. This is on a medium amount of data – like I said above I expect a production run to be a multiple of this.

12GB memory on this machine and about 70% free before I start Julia.

What’s the reason for the less efficient memory usage?

It’s not documented, and I have no user other than the dev team in my company. But maybe JSONPointer part of XLSXasJSON will suit your needs?

We don’t construct dataframe, we use a nested dictionary of JSON as it is.

This is a kind of clunky because this package is intended to read data from XLSX table.

using XLSXasJSON
import XLSXasJSON.

colname = ["var1" "var2" "current/1/ID" "current/1/var3" "current/1/var4" "potential/1/ID" "potential/1/var3" "potential/1/var4" "simstates/1/simulation" "simstates/1/date" "simstates/1/simvar1" "simstates/1/simvar2" "simstates/1/simvar3" "simstates/1/ID"] 
row1 = [12.34 12.34 1 12.34 12.34 10 12.34 12.34 512 2020-12-31 -0.0133 -0.0133 -0.0133 1]

jws =  JSONWorksheet("a.xlsx", "sheet1", [colname;row1])

# now accessing data with JSONPointer
jws[1, j"/current"]
1-element Array{Any,1}:
 OrderedCollections.OrderedDict{String,Any}("ID" => 1.0,"var3" => 12.34,"var4" => 12.34)

jws[1, j"/current/1"]
OrderedCollections.OrderedDict{String,Any} with 3 entries:
  "ID"   => 1.0
  "var3" => 12.34
  "var4" => 12.34

jws[1, j"/current/1/var3"]
12.34

Reduced the data to 100,000 records. DataFrame(x.simstates) throws a StackOverflowError

ERROR: StackOverflowError:
Stacktrace:
[1] Array at .\boot.jl:404 [inlined]
[2] allocatecolumn at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\Tables\okt7x\src\fallbacks.jl:107 [inlined]
[3] add_or_widen!(::Int64, ::Int64, ::Symbol, ::Array{Float64,1}, ::Int64, ::Base.RefValue{Any}, ::Base.HasShape{1}) at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\Tables\okt7x\src\fallbacks.jl:142
[4] __buildcolumns(::JSON3.Array{JSON3.Object,Base.CodeUnits{UInt8,String},SubArray{UInt64,1,Array{UInt64,1},Tuple{UnitRange{Int64}},true}}, ::Tuple{Int64,Int64}, ::Tables.Schema{(:simulation, :date, :net_cf, :net_cf_t, :value, :fundID),nothing}, ::Tuple{Array{Int64,1},Array{String,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Int64,1}}, ::Int64, ::Base.RefValue{Any}) at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\Tables\okt7x\src\utils.jl:187
[5] __buildcolumns(::JSON3.Array{JSON3.Object,Base.CodeUnits{UInt8,String},SubArray{UInt64,1,Array{UInt64,1},Tuple{UnitRange{Int64}},true}}, ::Tuple{Int64,Int64}, ::Tables.Schema{(:simulation, :date, :net_cf, :net_cf_t, :value, :fundID),nothing}, ::Tuple{Array{Int64,1},Array{String,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Int64,1}}, ::Int64, ::Base.RefValue{Any}) at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\Tables\okt7x\src\fallbacks.jl:163 (repeats 6523 times)
[6] _buildcolumns(::JSON3.Array{JSON3.Object,Base.CodeUnits{UInt8,String},SubArray{UInt64,1,Array{UInt64,1},Tuple{UnitRange{Int64}},true}}, ::JSON3.Object{Base.CodeUnits{UInt8,String},SubArray{UInt64,1,Array{UInt64,1},Tuple{UnitRange{Int64}},true}}, ::Tuple{Int64,Int64}, ::Tables.Schema{(:simulation, :date, :net_cf, :net_cf_t, :value, :fundID),nothing}, ::NTuple{6,Tables.EmptyVector}, ::Base.RefValue{Any}) at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\Tables\okt7x\src\fallbacks.jl:180
[7] buildcolumns at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\Tables\okt7x\src\fallbacks.jl:192 [inlined]
[8] columns at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\Tables\okt7x\src\fallbacks.jl:228 [inlined]
[9] #DataFrame#453(::Bool, ::Type{DataFrame}, ::JSON3.Array{JSON3.Object,Base.CodeUnits{UInt8,String},SubArray{UInt64,1,Array{UInt64,1},Tuple{UnitRange{Int64}},true}}) at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\DataFrames\S3ZFo\src\other\tables.jl:40
[10] DataFrame(::JSON3.Array{JSON3.Object,Base.CodeUnits{UInt8,String},SubArray{UInt64,1,Array{UInt64,1},Tuple{UnitRange{Int64}},true}}) at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\DataFrames\S3ZFo\src\other\tables.jl:31
[11] top-level scope at none:0

with JSONTables I need about 10 times less memory


function getDf1(inDict)
    df=vcat(DataFrame.(inDict["simstates"])...)
    return df
end

@btime getDf1($inDict)

function getDf2(inDict)
    df=DataFrame()
    for t in inDict["simstates"]
       d = DataFrame(t)
       append!(df,d)
    end
    return df
end

function getDf3(inDict)
    jt=jsontable(inDict["simstates"])
    df=DataFrame(jt)
    return df
end

@btime getDf1($inDict);
@btime getDf2($inDict);
@btime getDf3($inDict);


julia> @btime getDf1($inDict);
  154.891 ms (527550 allocations: 23.72 MiB)

julia> @btime getDf2($inDict);
  117.732 ms (432769 allocations: 20.44 MiB)

julia> @btime getDf3($inDict);
  43.288 ms (101468 allocations: 2.21 MiB)

for the order of magnitude (100k) you mention I get these results (1.6seconds)


function craete_dict(js)
    f= open(js,"r") 
    inDict = JSON3.read(f)
    close(f)
    
    return inDict
end 

julia> @btime inDict=craete_dict(js);
  74.655 ms (27 allocations: 28.93 MiB)

julia> @btime simstates = getDf3(inDict);
  1.616 s (3755468 allocations: 81.04 MiB)

julia> @show size(simstates)
size(simstates) = (129601, 6)
(129601, 6)

julia> @show simstates[1:3,:]
simstates[1:3, :] = 3Γ—6 DataFrame
β”‚ Row β”‚ simulation β”‚ date       β”‚ simvar1    β”‚ simvar2    β”‚ simvar3   β”‚ ID    β”‚
β”‚     β”‚ Int64      β”‚ String     β”‚ Float64    β”‚ Float64    β”‚ Float64   β”‚ Int64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 512        β”‚ 2020-12-31 β”‚ -0.0134955 β”‚ -0.0134955 β”‚ 0.0134955 β”‚ 1     β”‚
β”‚ 2   β”‚ 513        β”‚ 2020-12-31 β”‚ -0.0134955 β”‚ -0.0134955 β”‚ 0.0134955 β”‚ 11    β”‚
β”‚ 3   β”‚ 514        β”‚ 2020-12-31 β”‚ -0.0134955 β”‚ -0.0134955 β”‚ 0.0134955 β”‚ 12    β”‚

I’m getting the same stack overflow error as above:

julia> states = getDf3(inDict)
ERROR: StackOverflowError:
Stacktrace:
[1] Array at .\boot.jl:404 [inlined]
[2] allocatecolumn at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\Tables\okt7x\src\fallbacks.jl:107 [inlined]
[3] add_or_widen!(::Int64, ::Int64, ::Symbol, ::Array{Float64,1}, ::Int64, ::Base.RefValue{Any}, ::Base.HasLength) at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\Tables\okt7x\src\fallbacks.jl:142
[4] __buildcolumns(::JSONTables.Table{false,JSON3.Array{JSON3.Object,Base.CodeUnits{UInt8,String},SubArray{UInt64,1,Array{UInt64,1},Tuple{UnitRange{Int64}},true}}}, ::Tuple{Int64,Int64}, ::Tables.Schema{(:simulation, :date, :net_cf, :net_cf_t, :value, :fundID),nothing}, ::Tuple{Array{Int64,1},Array{String,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Int64,1}}, ::Int64, ::Base.RefValue{Any}) at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\Tables\okt7x\src\utils.jl:187

I’m going to switch over to a linux box I have running Julia1.4.1.

Well, maybe your JSON is structured differently than mine…

Furthermore, I wonder if things speed up, if you specify the date type explicitly. the JSON3 or JSONTables authors (or the readmes/docs) can probably help you with that.

Decided to try something completely different (still on Julia 1.3.1, Windows box)

function get_states2(statesArr)
    first = statesArr[1]
    out = DataFrame()
    for k in keys(first)
        out[!,Symbol(k)] = [s[k] for s in statesArr]
    end
    return(out)
end

julia> @btime vcat(DataFrame.(inDict[β€œsimstates”])…)
2.939 s (8490496 allocations: 529.05 MiB)
julia> @btime get_states2(inDict[β€œsimstates”])
136.951 ms (64 allocations: 6.87 MiB)

your get_states2 function has the same performance as my getDf3 function (for me).

you may want to parse the date as such (see below)
Also I personally would not declare a variable β€˜first’ as there is a function with that name.


function get_states3(statesArr)
    #=
    statesArr=inDict["simstates"]
    =#
    fir = statesArr[1]
    out = DataFrame()
    dfmt=DateFormat("yyyy-mm-dd")
    for k in keys(fir)
        if k==:date
            out[!,Symbol(k)] = Date[Date(s[k],dfmt) for s in statesArr]
        else 
            out[!,Symbol(k)] = [s[k] for s in statesArr]
        end
    end
    return(out)
end

1 Like

Thought about that. I am not doing any date manipulations, it is really just a key in the simulation and is used for grouping and aggregating results during the optimization. I figured a string was good enough.

I did find that I needed to convert back to the number types.

[s[k] for s in statesArr]

was making all numbers Real. That had performance implications later when using Query.jl for left joins based on the Int64 ID variable. I ended up with:

convert.(typeof(fir[k]),[s[k] for s in arr])

Also I personally would not declare a variable β€˜first’ as there is a function with that name.

Yeah, I learned that the hard way yesterday and had already fixed it!

I know this is a tad dated at this point, but JSON3 recently had a new release that should drastically reduce memory usage when parsing default json (i.e. JSON3.read(input)). Which should translate to JSONTables.jl using much less memory.

3 Likes