Hello everyone
I have an ETL process that reads ~20,000 JSON files (avg. 60kb each), materializes them into nested structs, performs some operations, and saves them back to disk.
I recently tried transitioning from JSON3 to the new JSON (v1.0) because I find the @tags and @defaults syntax in the new version very convenient for managing my schema. However, I’ve hit a significant performance wall.
- JSON3: ~5 seconds to process the batch.
- JSON v1.0: ~50 seconds (an order of magnitude slower).
I suspect my bottleneck is either my specific use of the JSON.parse API or the inherent architectural difference between JSON3’s “Tape” approach and JSON’s “Truly Lazy” approach for deeply nested structs.
Environment
- Julia v1.12.6
- JSON v1.5.2
- JSON3 v1.14.3
MWE
In my actual code, I have a Company struct that contains a Vector of Department structs, which in turn contain Vector of Employee structs. Here is a simplified version, built with the help of AI:
using JSON, JSON3, StructTypes, BenchmarkTools
# Defining structs
struct Employee
name::String
salary::Float64
id::Int
end
struct Department
name::String
employees::Vector{Employee}
end
struct Company
name::String
departments::Vector{Department}
end
StructTypes.StructType(::Type{Employee}) = StructTypes.Struct()
StructTypes.StructType(::Type{Department}) = StructTypes.Struct()
StructTypes.StructType(::Type{Company}) = StructTypes.Struct()
# Generate a massive nested dataset (100 depts x 100 employees = 10,000 sub-objects)
function generate_json()
emp_json = join(["""{"name":"John Doe","salary":50000.0,"id":$i}""" for i in 1:100], ",")
dept_json = join(["""{"name":"Dept $i","employees":[$emp_json]}""" for i in 1:100], ",")
return """{"name":"MegaCorp","departments":[$dept_json]}"""
end
const json_string = generate_json()
# Benchmarks (in-memory)
println("Benchmarking JSON3.read (String):")
@btime JSON3.read(json_string, Company);
println("\nBenchmarking JSON.parse (String):")
@btime JSON.parse(json_string, Company);
# 5. Disk Benchmark
const filename = "test_data.json"
write(filename, json_string)
println("\nBenchmarking JSON3.read (File):")
@btime JSON3.read(read(filename,String), Company);
println("\nBenchmarking JSON.parsefile (File):")
@btime JSON.parsefile(filename, Company);
In these benchmarks I was not able to reproduce the order-of-magnitude difference seen in my “real-use-case” example, but nevertheless the difference between JSON and JSON3 is I think telling, where JSON3 is about twice as fast as JSON?
Since I am materializing the entire struct (no fields are skipped), is JSON3’s “Tape” simply more efficient for this kind of “full tree” materialization? I’ve read about semi-lazing parsing of JSON3, which does not seem to be present in JSON 1.0. Maybe that is the reason why I see better speeds in JSON3 than in JSON for this kind of big file?
or
Am I doing something wrong when calling JSON.jl?
Any insights would be greatly appreciated!