Efficiently Read JSON and Create DataFrame

Dominic_Pazzula · June 9, 2020, 6:25pm

I am taking in a simulation from another process written in existing software. It outputs a JSON package of a few tables (stored as array of objects). I am running an optimization in JuMP based on the simulation. The optimization runs in a few seconds but reading the JSON and converting into a DataFrame takes a long time. Specifically converting the large simulation into a DataFrame.

I’m hoping that I am doing something inefficiently that can be improved.

function read_json(file)
    open(file,"r") do f
        global inDict
        inDict = JSON.parse(f)
    end
    return inDict
end
inDict = read_json(file)

println("Creating Data Frames")
simstates = vcat(DataFrame.(inDict["simstates"])...)

After calling the function ran a benchmark on the read

julia> @benchmark inDict = read_json()
BenchmarkTools.Trial:
memory estimate: 1.79 GiB
allocs estimate: 29547484

minimum time: 13.738 s (8.85% GC)
median time: 13.738 s (8.85% GC)
mean time: 13.738 s (8.85% GC)
maximum time: 13.738 s (8.85% GC)

samples: 1
evals/sample: 1

Then a benchmark on the conversion to DataFrame

@benchmark simstates = vcat(DataFrame.(inDict[“simstates”])…)
BenchmarkTools.Trial:
memory estimate: 3.95 GiB
allocs estimate: 65015493

minimum time: 26.722 s (8.33% GC)
median time: 26.722 s (8.33% GC)
mean time: 26.722 s (8.33% GC)
maximum time: 26.722 s (8.33% GC)

samples: 1
evals/sample: 1

So about 40 seconds to read in Simulation data. After that it takes about 5 seconds to do the optimization.

simstates is 765000×6 DataFrame

This is actually a small-ish simulation. I expect a production run to be a multiple of this.

Oscar_Smith · June 9, 2020, 6:30pm

What happens to the times if you replace read_json with

function read_json(file)
    open(file,"r") do f
        return JSON.parse(f)
    end
end

You also might want to try JSON3 GitHub - quinnj/JSON3.jl

Dominic_Pazzula · June 9, 2020, 6:44pm

Same, about 13 seconds.

I’ll take a look at JSON3.

Dominic_Pazzula · June 9, 2020, 7:00pm

If it helps, JSON is structured like so:

{ 
    "var1": 12.34,
    "var2": 12.34,
    "current": [
        {
            "ID": 1,
            "var3": 12.34,
            "var4": 12.34
        },
        ...
    ],
    "potential": [
        {
            "ID": 10,
            "var3": 12.34,
            "var4": 12.34
        },
        ...
    ],
    "simstates": [
        {
            "simulation": 512,
            "date": "2020-12-31",
            "simvar1": -0.013495534501,
            "simvar2": -0.013495534501,
            "simvar3": 0.013495534501,
            "ID": 1
        },
        ...
    ]
}

lwabeke · June 10, 2020, 11:44am

Hi

There are a few JSON packages available: JSON.jl, JSON2.jl, JSON3.jl, LazyJSON.jl are ones I’m aware of.

See [ANN] JSON3.jl - Yet another JSON package for Julia
for some details.

My summary:

JSON.jl - The original and does proper handling, but slow and very memory intensive. Uses parse to parse into a dict
LazyJSON.jl - Memory efficient, which makes it relatively fast. Parse also gives a “dict” type interface.
JSON2.jl & JSON3.jl - Claims to be fast. Parses into a provided type. At some point I looked at them and got the impression one (or both?) was cheating a bit: It assumed the JSON fields would match the sequence of fields defined in the structure (to speedup, it didn’t check if label matched). This works if the JSON string is created by the same package, but could break if done by another encoder. I’m not sure if this is still the case and/or if I just misunderstood the code.

I would assume LazyJSON would shine if you only need to access a subset of the JSON, but as part of the testing of the Unmarshal.jl package I got the impression that even if I unmarshal the whole object LazyJSON could still outperform JSON, but it depends on the size and complexity of the structure.

The Unmarshal.jl package can be used to convert from the JSON.jl and LazyJSON.jl dict interface to a Julia type object, which might be an alternative to what your doing in:

simstates = vcat(DataFrame.(inDict[“simstates”])…)

It is however focused on functionality and not really performance, in particular since working with the original JSON.parse, it seemed the JSON.parse dominated timing compared to the Unmarshal step.

quinnj · June 10, 2020, 1:01pm

Your descriptions of the various JSON packages are mostly correct, but I can add some additional color:

JSON.jl: you’re correct, it’s the old standard, but doesn’t try to do anything fancy w/ performance
LazyJSON.jl: completely lazy, but has a surprisingly high fixed cost just to crazy the LazyJSON.Object object, which I found could be prohibitive for really small objects. The biggest disadvantage is the package is basically unmaintained; no real commits in years and I suspect the code is getting pretty stale now
JSON2/JSON3.jl: JSON2.jl is effecitvely archived at this point and shouldn’t be used. JSON3.jl is the successor there and is indeed fast (this benchmarks the OP’s json data they posted in this thread):

julia> @btime JSON3.read(json);
  897.000 ns (2 allocations: 4.34 KiB)

julia> @btime JSON.parse(json);
  2.920 μs (61 allocations: 3.66 KiB)

It takes a hybrid lazy approach to avoid allocating when it doesn’t need to, making it very efficient when you only need a subset of the object.

There’s also the JSONTables.jl that makes the tables-to-json and back experience more convenient. That is, you can do df = DataFrame(jsontable(json_string)) and it just works.

Dominic_Pazzula · June 10, 2020, 1:22pm

There’s also the JSONTables.jl that makes the tables-to-json and back experience more convenient. That is, you can do df = DataFrame(jsontable(json_string)) and it just works.

I was looking at this the other day, how would that work given I have multiple tables inside the JSON string. That is, how could I easily use JSONTables.lj to pull out the “current,” “potential,” and “simstates” tables? Is that even possible with that library?

quinnj · June 10, 2020, 1:37pm

Ah, I didn’t realize each of those variables was a table. JSONTables.jl is more geared towards when your entire file/json is an array of objects or object of arrays. In your case, you can just do:

julia> x = JSON3.read(json)
JSON3.Object{Base.CodeUnits{UInt8,String},Array{UInt64,1}} with 5 entries:
  :var1      => 12.34
  :var2      => 12.34
  :current   => JSON3.Object[{…
  :potential => JSON3.Object[{…
  :simstates => JSON3.Object[{…

julia> cur = DataFrame(x.current)
1×3 DataFrame
│ Row │ ID    │ var3    │ var4    │
│     │ Int64 │ Float64 │ Float64 │
├─────┼───────┼─────────┼─────────┤
│ 1   │ 1     │ 12.34   │ 12.34   │

julia> pot = DataFrame(x.potential)
1×3 DataFrame
│ Row │ ID    │ var3    │ var4    │
│     │ Int64 │ Float64 │ Float64 │
├─────┼───────┼─────────┼─────────┤
│ 1   │ 10    │ 12.34   │ 12.34   │

julia> sim = DataFrame(x.simstates)
1×6 DataFrame
│ Row │ simulation │ date       │ simvar1    │ simvar2    │ simvar3   │ ID    │
│     │ Int64      │ String     │ Float64    │ Float64    │ Float64   │ Int64 │
├─────┼────────────┼────────────┼────────────┼────────────┼───────────┼───────┤
│ 1   │ 512        │ 2020-12-31 │ -0.0134955 │ -0.0134955 │ 0.0134955 │ 1     │

And that should be pretty efficient.

Dominic_Pazzula · June 10, 2020, 2:09pm

Thanks for the help. Unfortunately DataFrame(x.simstates) throws an out of memory error and then all hell breaks loose. This is on a medium amount of data – like I said above I expect a production run to be a multiple of this.

12GB memory on this machine and about 70% free before I start Julia.

What’s the reason for the less efficient memory usage?

YongHee-Kim · June 10, 2020, 2:39pm

It’s not documented, and I have no user other than the dev team in my company. But maybe JSONPointer part of XLSXasJSON will suit your needs?

We don’t construct dataframe, we use a nested dictionary of JSON as it is.

This is a kind of clunky because this package is intended to read data from XLSX table.

using XLSXasJSON
import XLSXasJSON.

colname = ["var1" "var2" "current/1/ID" "current/1/var3" "current/1/var4" "potential/1/ID" "potential/1/var3" "potential/1/var4" "simstates/1/simulation" "simstates/1/date" "simstates/1/simvar1" "simstates/1/simvar2" "simstates/1/simvar3" "simstates/1/ID"] 
row1 = [12.34 12.34 1 12.34 12.34 10 12.34 12.34 512 2020-12-31 -0.0133 -0.0133 -0.0133 1]

jws =  JSONWorksheet("a.xlsx", "sheet1", [colname;row1])

# now accessing data with JSONPointer
jws[1, j"/current"]
1-element Array{Any,1}:
 OrderedCollections.OrderedDict{String,Any}("ID" => 1.0,"var3" => 12.34,"var4" => 12.34)

jws[1, j"/current/1"]
OrderedCollections.OrderedDict{String,Any} with 3 entries:
  "ID"   => 1.0
  "var3" => 12.34
  "var4" => 12.34

jws[1, j"/current/1/var3"]
12.34

Dominic_Pazzula · June 10, 2020, 2:43pm

Reduced the data to 100,000 records. DataFrame(x.simstates) throws a StackOverflowError

ERROR: StackOverflowError:
Stacktrace:
[1] Array at .\boot.jl:404 [inlined]
[2] allocatecolumn at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\Tables\okt7x\src\fallbacks.jl:107 [inlined]
[3] add_or_widen!(::Int64, ::Int64, ::Symbol, ::Array{Float64,1}, ::Int64, ::Base.RefValue{Any}, ::Base.HasShape{1}) at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\Tables\okt7x\src\fallbacks.jl:142
[4] __buildcolumns(::JSON3.Array{JSON3.Object,Base.CodeUnits{UInt8,String},SubArray{UInt64,1,Array{UInt64,1},Tuple{UnitRange{Int64}},true}}, ::Tuple{Int64,Int64}, ::Tables.Schema{(:simulation, :date, :net_cf, :net_cf_t, :value, :fundID),nothing}, ::Tuple{Array{Int64,1},Array{String,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Int64,1}}, ::Int64, ::Base.RefValue{Any}) at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\Tables\okt7x\src\utils.jl:187
[5] __buildcolumns(::JSON3.Array{JSON3.Object,Base.CodeUnits{UInt8,String},SubArray{UInt64,1,Array{UInt64,1},Tuple{UnitRange{Int64}},true}}, ::Tuple{Int64,Int64}, ::Tables.Schema{(:simulation, :date, :net_cf, :net_cf_t, :value, :fundID),nothing}, ::Tuple{Array{Int64,1},Array{String,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Int64,1}}, ::Int64, ::Base.RefValue{Any}) at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\Tables\okt7x\src\fallbacks.jl:163 (repeats 6523 times)
[6] _buildcolumns(::JSON3.Array{JSON3.Object,Base.CodeUnits{UInt8,String},SubArray{UInt64,1,Array{UInt64,1},Tuple{UnitRange{Int64}},true}}, ::JSON3.Object{Base.CodeUnits{UInt8,String},SubArray{UInt64,1,Array{UInt64,1},Tuple{UnitRange{Int64}},true}}, ::Tuple{Int64,Int64}, ::Tables.Schema{(:simulation, :date, :net_cf, :net_cf_t, :value, :fundID),nothing}, ::NTuple{6,Tables.EmptyVector}, ::Base.RefValue{Any}) at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\Tables\okt7x\src\fallbacks.jl:180
[7] buildcolumns at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\Tables\okt7x\src\fallbacks.jl:192 [inlined]
[8] columns at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\Tables\okt7x\src\fallbacks.jl:228 [inlined]
[9] #DataFrame#453(::Bool, ::Type{DataFrame}, ::JSON3.Array{JSON3.Object,Base.CodeUnits{UInt8,String},SubArray{UInt64,1,Array{UInt64,1},Tuple{UnitRange{Int64}},true}}) at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\DataFrames\S3ZFo\src\other\tables.jl:40
[10] DataFrame(::JSON3.Array{JSON3.Object,Base.CodeUnits{UInt8,String},SubArray{UInt64,1,Array{UInt64,1},Tuple{UnitRange{Int64}},true}}) at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\DataFrames\S3ZFo\src\other\tables.jl:31
[11] top-level scope at none:0

bernhard · June 10, 2020, 2:43pm

with JSONTables I need about 10 times less memory


function getDf1(inDict)
    df=vcat(DataFrame.(inDict["simstates"])...)
    return df
end

@btime getDf1($inDict)

function getDf2(inDict)
    df=DataFrame()
    for t in inDict["simstates"]
       d = DataFrame(t)
       append!(df,d)
    end
    return df
end

function getDf3(inDict)
    jt=jsontable(inDict["simstates"])
    df=DataFrame(jt)
    return df
end

@btime getDf1($inDict);
@btime getDf2($inDict);
@btime getDf3($inDict);


julia> @btime getDf1($inDict);
  154.891 ms (527550 allocations: 23.72 MiB)

julia> @btime getDf2($inDict);
  117.732 ms (432769 allocations: 20.44 MiB)

julia> @btime getDf3($inDict);
  43.288 ms (101468 allocations: 2.21 MiB)

for the order of magnitude (100k) you mention I get these results (1.6seconds)


function craete_dict(js)
    f= open(js,"r") 
    inDict = JSON3.read(f)
    close(f)
    
    return inDict
end 

julia> @btime inDict=craete_dict(js);
  74.655 ms (27 allocations: 28.93 MiB)

julia> @btime simstates = getDf3(inDict);
  1.616 s (3755468 allocations: 81.04 MiB)

julia> @show size(simstates)
size(simstates) = (129601, 6)
(129601, 6)

julia> @show simstates[1:3,:]
simstates[1:3, :] = 3×6 DataFrame
│ Row │ simulation │ date       │ simvar1    │ simvar2    │ simvar3   │ ID    │
│     │ Int64      │ String     │ Float64    │ Float64    │ Float64   │ Int64 │
├─────┼────────────┼────────────┼────────────┼────────────┼───────────┼───────┤
│ 1   │ 512        │ 2020-12-31 │ -0.0134955 │ -0.0134955 │ 0.0134955 │ 1     │
│ 2   │ 513        │ 2020-12-31 │ -0.0134955 │ -0.0134955 │ 0.0134955 │ 11    │
│ 3   │ 514        │ 2020-12-31 │ -0.0134955 │ -0.0134955 │ 0.0134955 │ 12    │

Dominic_Pazzula · June 10, 2020, 2:59pm

I’m getting the same stack overflow error as above:

julia> states = getDf3(inDict)
ERROR: StackOverflowError:
Stacktrace:
[1] Array at .\boot.jl:404 [inlined]
[2] allocatecolumn at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\Tables\okt7x\src\fallbacks.jl:107 [inlined]
[3] add_or_widen!(::Int64, ::Int64, ::Symbol, ::Array{Float64,1}, ::Int64, ::Base.RefValue{Any}, ::Base.HasLength) at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\Tables\okt7x\src\fallbacks.jl:142
[4] __buildcolumns(::JSONTables.Table{false,JSON3.Array{JSON3.Object,Base.CodeUnits{UInt8,String},SubArray{UInt64,1,Array{UInt64,1},Tuple{UnitRange{Int64}},true}}}, ::Tuple{Int64,Int64}, ::Tables.Schema{(:simulation, :date, :net_cf, :net_cf_t, :value, :fundID),nothing}, ::Tuple{Array{Int64,1},Array{String,1},Array{Float64,1},Array{Float64,1},Array{Float64,1},Array{Int64,1}}, ::Int64, ::Base.RefValue{Any}) at C:\Users\dpazzula.juliapro\JuliaPro_v1.3.1-2\packages\Tables\okt7x\src\utils.jl:187

I’m going to switch over to a linux box I have running Julia1.4.1.

bernhard · June 10, 2020, 3:19pm

Well, maybe your JSON is structured differently than mine…

Furthermore, I wonder if things speed up, if you specify the date type explicitly. the JSON3 or JSONTables authors (or the readmes/docs) can probably help you with that.

Dominic_Pazzula · June 10, 2020, 3:27pm

Decided to try something completely different (still on Julia 1.3.1, Windows box)

function get_states2(statesArr)
    first = statesArr[1]
    out = DataFrame()
    for k in keys(first)
        out[!,Symbol(k)] = [s[k] for s in statesArr]
    end
    return(out)
end

julia> @btime vcat(DataFrame.(inDict[“simstates”])…)
2.939 s (8490496 allocations: 529.05 MiB)
julia> @btime get_states2(inDict[“simstates”])
136.951 ms (64 allocations: 6.87 MiB)

bernhard · June 11, 2020, 10:45am

your get_states2 function has the same performance as my getDf3 function (for me).

you may want to parse the date as such (see below)
Also I personally would not declare a variable ‘first’ as there is a function with that name.


function get_states3(statesArr)
    #=
    statesArr=inDict["simstates"]
    =#
    fir = statesArr[1]
    out = DataFrame()
    dfmt=DateFormat("yyyy-mm-dd")
    for k in keys(fir)
        if k==:date
            out[!,Symbol(k)] = Date[Date(s[k],dfmt) for s in statesArr]
        else 
            out[!,Symbol(k)] = [s[k] for s in statesArr]
        end
    end
    return(out)
end

Dominic_Pazzula · June 11, 2020, 12:18pm

Thought about that. I am not doing any date manipulations, it is really just a key in the simulation and is used for grouping and aggregating results during the optimization. I figured a string was good enough.

I did find that I needed to convert back to the number types.

[s[k] for s in statesArr]

was making all numbers Real. That had performance implications later when using Query.jl for left joins based on the Int64 ID variable. I ended up with:

convert.(typeof(fir[k]),[s[k] for s in arr])

Also I personally would not declare a variable ‘first’ as there is a function with that name.

Yeah, I learned that the hard way yesterday and had already fixed it!

quinnj · July 27, 2020, 12:48pm

I know this is a tad dated at this point, but JSON3 recently had a new release that should drastically reduce memory usage when parsing default json (i.e. JSON3.read(input)). Which should translate to JSONTables.jl using much less memory.

stene · February 19, 2025, 10:57am

So that other people can find it. This method works on large scale real world files:

using JSON
using DataFrames
using TidierData

dataSet=JSON.parsefile(“testfile.json”)
df=DataFrame(dataSet)
friends=@unnest_wider(df, friends)
hobbies_long=@unnest_longer(friends, hobbies)
hobbies=@unnest_wider(hobbies_long,address)

{
“name”: “Chris”,
“age”: 23,
“address”: {
“city”: “New York”,
“country”: “America”
},
“friends”: [
{
“name”: “Emily”,
“hobbies”: [ “biking”, “music”, “gaming” ]
},
{
“name”: “John”,
“hobbies”: [ “soccer”, “gaming” ]
}
]
}

rocco_sprmnt21 · February 19, 2025, 1:05pm

could you try with this file and show how it works?

using YFinance, JSON, DataFrames

aapl_json=get_quoteSummary("AAPL")

Topic		Replies	Views
Reading a large JSON file make Julia crashing Data	10	1153	December 22, 2021
DataFrames, best way to import from JSON format file Data	6	8468	October 15, 2019
How to read Panda's DataFrames from json file? New to Julia dataframes	27	978	February 1, 2023
Speed up data extraction from large json file New to Julia	11	654	November 13, 2023
How to read .json file in dataframe without using Pandas.jl General Usage json , dataframes	1	271	August 4, 2022

Efficiently Read JSON and Create DataFrame

julia> @benchmark inDict = read_json()
BenchmarkTools.Trial:
memory estimate: 1.79 GiB
allocs estimate: 29547484

minimum time: 13.738 s (8.85% GC)
median time: 13.738 s (8.85% GC)
mean time: 13.738 s (8.85% GC)
maximum time: 13.738 s (8.85% GC)

@benchmark simstates = vcat(DataFrame.(inDict[“simstates”])…)
BenchmarkTools.Trial:
memory estimate: 3.95 GiB
allocs estimate: 65015493

minimum time: 26.722 s (8.33% GC)
median time: 26.722 s (8.33% GC)
mean time: 26.722 s (8.33% GC)
maximum time: 26.722 s (8.33% GC)

Efficiently Read JSON and Create DataFrame

julia> @benchmark inDict = read_json() BenchmarkTools.Trial: memory estimate: 1.79 GiB allocs estimate: 29547484

minimum time: 13.738 s (8.85% GC) median time: 13.738 s (8.85% GC) mean time: 13.738 s (8.85% GC) maximum time: 13.738 s (8.85% GC)

@benchmark simstates = vcat(DataFrame.(inDict[“simstates”])…) BenchmarkTools.Trial: memory estimate: 3.95 GiB allocs estimate: 65015493

minimum time: 26.722 s (8.33% GC) median time: 26.722 s (8.33% GC) mean time: 26.722 s (8.33% GC) maximum time: 26.722 s (8.33% GC)

Related topics

julia> @benchmark inDict = read_json()
BenchmarkTools.Trial:
memory estimate: 1.79 GiB
allocs estimate: 29547484

minimum time: 13.738 s (8.85% GC)
median time: 13.738 s (8.85% GC)
mean time: 13.738 s (8.85% GC)
maximum time: 13.738 s (8.85% GC)

@benchmark simstates = vcat(DataFrame.(inDict[“simstates”])…)
BenchmarkTools.Trial:
memory estimate: 3.95 GiB
allocs estimate: 65015493

minimum time: 26.722 s (8.33% GC)
median time: 26.722 s (8.33% GC)
mean time: 26.722 s (8.33% GC)
maximum time: 26.722 s (8.33% GC)