JSON list of lists to Julia matrix, preferably fast and with low memory overhead

I have JSON files that contain a list of square matrices. Let’s say each JSON has a single key "mats" which is a list of list of lists. Each element in the outer list—i.e. a list of lists—is a matrix in row major order. That is, each list of the outer list of lists of lists is a row of a matrix.

(Is this a good way to pass matrices around? Probably not. But it’s out of my control for now.)

Typically there are ~100 matrices of size ~1000x1000. Each matrix is square and of the same size. The matrices have real valued elements, either 0 or a float.

I’d like to parse the JSON file and transform to a list of regular Julia matrices, and I’m concerned with performance both in time and memory usage. Some strategies I’ve tried and associated problems:

  • Use JSON.jl to parse. Problem: JSON parsing could be faster and there seems to be some issue with JSON not freeing memory. Also parsed arrays have type Vector{Any} which doesn’t seem ideal for performance.

  • Use JSON3.jl to parse. JSON3 parsing is fast and memory efficient, but, problem: natural patterns for accessing the parsed nested JSON3.Array elements can be prohibitively slow, hundreds of times slower than accessing the parsed Array{Any} returned by JSON.jl parsing. Github issue. I’m not sure if this is a bug or expected, but either way seems like this strategy ends up quite slow. For example, compare converting a list-of-lists parsed wiith JSON or JSON3:

mat(arrs) = [arrs[i][j] for i in 1:length(arrs), j in 1:length(arrs)]

julia> arrs_json[1:2]
2-element Array{Any,1}:
 Any[0.2727272727272727, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.09090909090909091, 0.0  …  0.0, 0.09090909090909091, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 Any[0.0, 0.23529411764705882, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

julia> typeof(arrs_json)
Array{Any,1}

julia> @belapsed mat($arrs_json)
0.054457167

julia> arrs_j3[1:2]
2-element Array{JSON3.Array,1}:
 [0.2727272727272727, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.09090909090909091, 0.0  …  0.0, 0.09090909090909091, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 Union{Float64, Int64}[0, 0.23529411764705882, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

julia> typeof(arrs_j3)
JSON3.Array{JSON3.Array,Base.CodeUnits{UInt8,String},SubArray{UInt64,1,Array{UInt64,1},Tuple{UnitRange{Int64}},true}}

julia> @belapsed mat($arrs_j3)
9.723495698  # 180x slower
  • Use JSON3.jl to parse, then convert to standard Julia arrays with copy. Problem: extra memory and time overhead from an extra copy. Also, the standard Julia arrays returned by copying the JSON3 parsed object perform better than JSON3.Array, but actually much worse than Vector{Any}. I’m not sure why the extra type information hurts performance, but here we are:
julia> arrs_j3_copy = copy(arrs_j3);

julia> arrs_j3_copy[1:2]
2-element Array{Array{T,1} where T,1}:
 [0.2727272727272727, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.09090909090909091, 0.0  …  0.0, 0.09090909090909091, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 Real[0, 0.23529411764705882, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

julia> typeof(arrs_j3_copy)
Array{Array{T,1} where T,1}

julia> @belapsed mat($arrs_j3_copy)
0.081806462. # 1.6x slower than arrs_json

So any tips would be much appreciated - thanks!

2 Likes

I also commented on your Github issue, but I would suggest trying passing the type to JSON3 - e.g., for a nested array

julia> JSON3.read("[[1.1, 2.2], [3.3, 4.4]]", Vector{Vector{Float64}})
2-element Array{Array{Float64,1},1}:
 [1.1, 2.2]
 [3.3, 4.4]
1 Like

Thanks for the tip. How does this work when the JSON blob to be parsed has multiple keys, only one of which is the list of lists of lists which define these matrices? Then the type I want the JSON to parse to seems to be something like Dict{Symbol, Any}, but passing that type as the second argument to JSON3.read makes parsing slower than JSON.parse.

In case anyone has a similar problem in the future, here’s the best I’ve been able to come up with so far, both using JSON and JSON3. These solutions are both a tad convoluted, but seem to benchmark better than the more obvious choices.

using JSON, JSON3, BenchmarkTools
n = 500
rand_square(n) = [ifelse(rand() < .1, rand(), 0) for i in 1:n, j in 1:n]
data_str = JSON.json(Dict("a" => "hi", "mat" => rand_square(n)))

"Convert a list of lists representing matrix rows to a dense Float64 matrix."
function to_mat(arrs::JSON3.Array)
    return Matrix(hcat(typed_copy.(arrs)...)')
end
function to_mat(arrs) # for lists-of-lists parsed by JSON
    return Matrix(hcat(Vector{Float64}.(arrs)...)')
end

"Copy a JSON3.Array containing real valued elements into a regular Vector{Float64}."
function typed_copy(j3arr)
    n = length(j3arr)
    x = Vector{Float64}(undef, n)
    for i in 1:n
        x[i] = j3arr[i]
    end
    return x
end

"Use JSON to parse a JSON string and then conver the mat key to a Matrix."
function json_build_mat(data)
    d = JSON.parse(data)
    arrs = d["mat"]
    return to_mat(arrs)
end

"Use JSON3 to parse a JSON string and then conver the mat key to a Matrix."
function json3_build_mat(data)
    d = JSON3.read(data)
    arrs = d[:mat]
    return to_mat(arrs)
end

println("JSON parse and convert time: ", @belapsed json_build_mat($data_str))
println("JSON3 parse and convert time: ", @belapsed json3_build_mat($data_str))
---
JSON parse and convert time: 0.018888301
JSON3 parse and convert time: 0.0174836

So ultimately there’s no meaningful time saving from using JSON3 over JSON, but it still seems better because of lower memory usage during parsing and the aforementioned failure of JSON to release memory.

Given that the JSON3 docs warn against using getindex on JSON3.Array:

PLEASE NOTE that iterating a JSON3.Array will be much more performant than calling getindex on each index due to the internal “view” nature of the array.

I was super surprised that this typed_copy seemed better as fast as less-readable versions that did for val in j3arr. And similarly, a straight call to convert or a broadcasted Float64 seem comparatively glacial:

julia> arr_to_convert = JSON3.read(data_str)[:mat][1];

julia> typeof(arr_to_convert)
JSON3.Array{Union{Float64, Int64},Base.CodeUnits{UInt8,String},SubArray{UInt64,1,Array{UInt64,1},Tuple{UnitRange{Int64}},true}}

julia> @belapsed convert(Vector{Float64}, $arr_to_convert)
0.0001057

julia> @belapsed Float64.($arr_to_convert)
0.000104999

julia> @belapsed typed_copy($arr_to_convert)
6.5998e-6 # 15x faster than the above

Lastly, on a lighter and flippant note, I really goofed. Given that Numpy has row-major arrays, this task would be easy and fast in Python. So I should have called this topic converting JSON to matrix is much slower than in Python and watched the performance sharks dive in :slight_smile:

1 Like

I was able to get about a factor 2 speedup by using the StructTypes interface, assuming you know the schema of your JSON

julia> println("JSON3 parse and convert time: ", @belapsed json3_build_mat($data_str))
JSON3 parse and convert time: 0.0247677

julia> println("JSON3 types parse and convert time: ", @belapsed json3_typed($data_str))
JSON3 types parse and convert time: 0.0101308

Here’s the definition, added on to your setup

using StructTypes
struct DataSchema                 
    mat::Vector{Vector{Float64}}  
    a::String                     
end                               
StructTypes.StructType(::Type{DataSchema}) = StructTypes.Struct()

function json3_typed(data)          
    d = JSON3.read(data, DataSchema)
    return to_mat(d.mat)            
end                                 
1 Like

Ah that’s really nice, thank you for the clarification. I think in my production use case I can’t use StructTypes.Struct() because of the docs warning about order of JSON fields; my Julia code can’t guarantee the order. But it’s a good trick to know, and in this case I might be able to leverage StructTypes.Mutable().

(For a little context: we’re running a Julia microservice. Args are passed as JSON, and the idea is to just pass the args directly to the “actually do the work” functions as kwargs. It sounds like getting a proper speedup here would require defining a mutable struct type that knows the JSON keys it might receive, which in turn blurs the separation between the “parse args and pass as keywords” part of the code and the “do work” part. Maybe that’s inevitable, or worth the cost.)