JSON.jl (v1.0) vs JSON3.jl for intensive struct materialization (20,000+ files)

Hello everyone

I have an ETL process that reads ~20,000 JSON files (avg. 60kb each), materializes them into nested structs, performs some operations, and saves them back to disk.

I recently tried transitioning from JSON3 to the new JSON (v1.0) because I find the @tags and @defaults syntax in the new version very convenient for managing my schema. However, I’ve hit a significant performance wall.

  • JSON3: ~5 seconds to process the batch.
  • JSON v1.0: ~50 seconds (an order of magnitude slower).

I suspect my bottleneck is either my specific use of the JSON.parse API or the inherent architectural difference between JSON3’s “Tape” approach and JSON’s “Truly Lazy” approach for deeply nested structs.

Environment

  • Julia v1.12.6
  • JSON v1.5.2
  • JSON3 v1.14.3

MWE

In my actual code, I have a Company struct that contains a Vector of Department structs, which in turn contain Vector of Employee structs. Here is a simplified version, built with the help of AI:

using JSON, JSON3, StructTypes, BenchmarkTools

# Defining structs
struct Employee
    name::String
    salary::Float64
    id::Int
end

struct Department
    name::String
    employees::Vector{Employee}
end

struct Company
    name::String
    departments::Vector{Department}
end

StructTypes.StructType(::Type{Employee}) = StructTypes.Struct()
StructTypes.StructType(::Type{Department}) = StructTypes.Struct()
StructTypes.StructType(::Type{Company}) = StructTypes.Struct()

# Generate a massive nested dataset (100 depts x 100 employees = 10,000 sub-objects)
function generate_json()
    emp_json = join(["""{"name":"John Doe","salary":50000.0,"id":$i}""" for i in 1:100], ",")
    dept_json = join(["""{"name":"Dept $i","employees":[$emp_json]}""" for i in 1:100], ",")
    return """{"name":"MegaCorp","departments":[$dept_json]}"""
end

const json_string = generate_json()

# Benchmarks (in-memory)
println("Benchmarking JSON3.read (String):")
@btime JSON3.read(json_string, Company);

println("\nBenchmarking JSON.parse (String):")
@btime JSON.parse(json_string, Company);

# 5. Disk Benchmark
const filename = "test_data.json"
write(filename, json_string)

println("\nBenchmarking JSON3.read (File):")
@btime JSON3.read(read(filename,String), Company);

println("\nBenchmarking JSON.parsefile (File):")
@btime JSON.parsefile(filename, Company);

In these benchmarks I was not able to reproduce the order-of-magnitude difference seen in my “real-use-case” example, but nevertheless the difference between JSON and JSON3 is I think telling, where JSON3 is about twice as fast as JSON?

Since I am materializing the entire struct (no fields are skipped), is JSON3’s “Tape” simply more efficient for this kind of “full tree” materialization? I’ve read about semi-lazing parsing of JSON3, which does not seem to be present in JSON 1.0. Maybe that is the reason why I see better speeds in JSON3 than in JSON for this kind of big file?

or

Am I doing something wrong when calling JSON.jl?

Any insights would be greatly appreciated!

If possible, find a big representative file and try @btime-ing the file read, typed parse, and file save separately, don’t benchmark the operations. Might be several performance discrepancies and non-discrepancies being lumped together in real usage that isn’t captured by your typed-parse benchmark. I’m seeing the same if smaller result in the benchmark though, and the file read at least seems to be a negligible fraction.

Benchmarking JSON3.read (String):
  2.604 ms (40909 allocations: 1.67 MiB)

Benchmarking JSON.parse (String):
  3.499 ms (33427 allocations: 1.47 MiB)

Benchmarking JSON3.read (File):
  2.723 ms (40919 allocations: 2.10 MiB)

Benchmarking JSON.parsefile (File):
  3.519 ms (33437 allocations: 1.90 MiB)

An aside, remember to do $json_string to treat subexpressions as externally evaluated arguments if intended in practice. It was negligible in this benchmark if it even happened at all, but the compiler sometimes optimizes away simple operations on constants:

julia> const x = 2.0
2.0

julia> @btime x^100 # x and ^ are constant variables, 100 is constant literal
  1.000 ns (0 allocations: 0 bytes)
1.2676506002282294e30

julia> @btime $x^100 # x treated as argument, ^ and 100 are not
  9.510 ns (0 allocations: 0 bytes)
1.2676506002282294e30