LazyJSON.jl implements yet another a different way of reading JSON data in Julia. I wrote this as a proof of concept and it is probably not production ready, but if you work with JSON data in a performance sensitive application, this approach might be beneficial. Documentation is in README.md
.
LazyJSON.jl provides direct access to values stored in a JSON text though standard Julia interfaces: Number, AbstractString, AbstractVector and AbstractDict.
LazyJSON is lazy in the sense that it does not process any part of the JSON text until values are requested through the AbstractVector and AbstractDict interfaces.
i.e. j = LazyJSON.value(jsontext)
does no parsing and immediately
returns a thin wrapper object.
j["foo"]
calls get(::AbstractDict, "foo")
, which parses just enough to find
the "foo"
field.
j["foo"][4]
calls getindex(::AbstractArray, 4)
, which continues paring up to
the fourth item in the array.
This results in much less memory allocation compared to non-lazy parsers:
JSON.jl:
j = String(read("ec2-2016-11-15.normal.json"))
julia> function f(json)
v = JSON.parse(json)
v["shapes"]["scope"]["enum"][1]
end
julia> @time f(j)
0.066773 seconds (66.43 k allocations: 7.087 MiB)
"Availability Zone"
LazyJSON.jl:
julia> function f(json)
v = LazyJSON.parse(json)
v["shapes"]["scope"]["enum"][1]
end
julia> @time f(j)
0.001392 seconds (12 allocations: 384 bytes)
"Availability Zone"
LazyJSON’s AbstractString
and Number
implementations are lazy too.
The text of a LazyJSON.Number
is not parsed to Int64
or Float64
form until it is needed for a numeric operation. If the number is only used in a textual context, it need never be parsed at all. e.g.
j = LazyJSON.value(jsontext)
html = """<img width=$(j["width"]), height=$(j["height"])>"""
Likewise, the content of a LazyJSON.String
is not interpreted until it is accessed. If a LazyJSON.String
containing complex UTF16 escape sequences is compared to a UTF8 Base.String
, and the two strings differ in the first few characters, then the comparison will terminate before the any unescaping work needs to be done.
The values returned by LazyJSON consist of a reference to the complete JSON text String
and the byte index of the value text. The LazyJSON.value(jsontext)
function
simply returns a LazyJSON.Value
object with s = jsontext
and i = 1
.
String: {"foo": 1, "bar": [1, 2, 3, "four"]}
▲ ▲ ▲ ▲
│ │ │ │
├─────────────────┐ │ │ │
│ LazyJSON.Array( s, i=9) │ │ == Any[1, 2, 3, "four"]
│ │ │
├─────────────────┐ ┌──────┘ │
│ LazyJSON.Number(s, i=16) │ == 3
│ │
├─────────────────┐ ┌─────────┘
│ LazyJSON.String(s, i=19) == "four"
│
└─────────────────┬──┐
LazyJSON.Object(s, i=1)
Mirco Zeiss’ 180MB citylots.json file provides a nice demonstration of the potential performance benefits:
julia> j = String(read("citylots.json"));
julia> const J = LazyJSON
julia> function load_coords(j, n)
d = DataFrame(x = Float64[], y = Float64[], z = Float64[])
for x in J.parse(j)["features"][n]["geometry"]["coordinates"]
for v in x
push!(d, v)
end
end
return d
end
Accessing data near the start of the file is really fast:
julia> @time load_coords(j, 1)
0.000080 seconds (128 allocations: 5.438 KiB)
5×3 DataFrame
│ Row │ x │ y │ z │
├─────┼──────────┼─────────┼─────┤
│ 1 │ -122.422 │ 37.8085 │ 0.0 │
...
│ 5 │ -122.422 │ 37.8085 │ 0.0 │
Accessing the last record in the file is slower, but memory use stays low:
julia> @time load_coords(j, 206560)
0.236713 seconds (217 allocations: 8.422 KiB)
11×3 DataFrame
│ Row │ x │ y │ z │
├─────┼──────────┼─────────┼─────┤
│ 1 │ -122.424 │ 37.7829 │ 0.0 │
...
│ 11 │ -122.424 │ 37.7829 │ 0.0 │
The non-lazy standard JSON.jl
parser uses more time and more memory irrespective of the location of the data:
julia> @time load_coords(j, 1)
6.097472 seconds (52.32 M allocations: 1.738 GiB, 36.45% gc time)
5×3 DataFrame
However, the standard parser produces a tree of Array
and Dict
objects that provide fast access to all the values in the JSON text. So the overhead of parsing the whole file may be worth it if you need random access to multiple values. More information on the performance tradeoffs of lazy parsing is here: LazyJSON Performance Considerations.