JSON Performance Tests

samoconnor · November 4, 2018, 7:52am

I noticed that @kristoffer.carlsson has been working on parsing performance in JSON.jl: A few performance improvements by KristofferC · Pull Request #263 · JuliaIO/JSON.jl · GitHub … and it got me thinking.

When I first created LazyJSON.jl I targeted Julia 0.7-dev at a time when there were a lot of string changes going on. At the time performance comparisons with JSON.jl and JSON2.jl would vary wildly between Julia 0.7-dev nightly builds as various deprecation penalties came and went. Now that Julia 1.0 is out and JSON.jl and JSON2.jl have been updated for Julia 1.0, it seems like a good time to compare performance again.

These tests compare LazyJSON.jl to @kristoffer.carlsson’s kc/opt branch and JSON2.jl v0.2.3.

Test code is here: LazyJSON.jl/benchmark.jl at master · JuliaCloud/LazyJSON.jl · GitHub

LazyJSON.jl seems to take about the same time as JSON.jl to do flat non-lazy parsing to Julia Dict/Array etc (sometimes a bit faster, sometimes a bit slower).
In lazy mode LazyJSON.jl is orders of magnitude faster when only part of the input data is used.

@fengyang.wang what are your thoughts on either adding lazy parsing as an option in JSON.jl or replacing the current parser with a lazy parser? (One thing that would have to be done is to add a validation option to do strict JSON syntax checking for use cases where that is required. As it stands the Lazy parser will ignore some JSON syntax errors due to lazyness).

@quinnj Can you comment on the JSON2 results? I know that JSON2 is optimised for marshalling/unmarshalling, so perhaps my tests that result in JSON2 returning NamedTuples are a degenerate case. In test6 I tried to test JSON2’s direct-to-struct parsing in a way that seems to me to be “how JSON2 is intended to be used”, but it still seems a bit slow. Perhaps you can suggest a test case that would best demonstrate JSON2’s strengths.

@Nosferican I believe that you have been using LazyJSON.jl in NCEI.jl. Do you have any feedback from your use of the package?

test1

Reads ec2-2016-11-15.normal.json and extracts a single value:
operations.AcceptReservedInstancesExchangeQuote.input.shape.
This value is close to the start of the input data.

Variants:

Lazy: LazyJSON.jl AbstractDict interface.
Lazy (B): LazyJSON.jl getproperty interface.
Lazy (C): LazyJSON.jl lazy=false (parse whole input to Dicts etc like JSON.jl does)
JSON: JSON.jl parse interface.
JSON2: JSON2.jl read -> NamedTuple interface.

results = 5×6 DataFrame
│ Row │ Test  │ Variant  │ μs     │ bytes     │ poolalloc │ bigalloc │
├─────┼───────┼──────────┼────────┼───────────┼───────────┼──────────┤
│ 1   │ test1 │ Lazy     │ 54     │ 5184      │ 269       │ 0        │
│ 2   │ test1 │ Lazy (B) │ 51     │ 5408      │ 277       │ 0        │
│ 3   │ test1 │ Lazy (C) │ 105628 │ 51409504  │ 980424    │ 300      │
│ 4   │ test1 │ JSON     │ 103870 │ 50429936  │ 491747    │ 510      │
│ 5   │ test1 │ JSON2    │ 609448 │ 147471280 │ 4162257   │ 890      │

Note: LazyJSON.jl is similar to JSON.jl in speed and memory use in non-lazy mode.

test2

Read ec2-2016-11-15.normal.json and extracts an array value:
shapes.scope.enum
This value is close to the end of the input data.

Variants:

Lazy: LazyJSON.jl AbstractDict interface.
Lazy (B): LazyJSON.jl getproperty interface.
Lazy (C): LazyJSON.jl lazy=false (parse whole input to Dicts etc)
JSON: JSON.jl parse interface.
JSON2: JSON2.jl read -> NamedTuple interface.

results = 5×6 DataFrame
│ Row │ Test  │ Variant  │ μs     │ bytes     │ poolalloc │ bigalloc │
├─────┼───────┼──────────┼────────┼───────────┼───────────┼──────────┤
│ 1   │ test2 │ Lazy     │ 11035  │ 3296      │ 162       │ 0        │
│ 2   │ test2 │ Lazy (B) │ 11028  │ 3440      │ 168       │ 0        │
│ 3   │ test2 │ Lazy (C) │ 115045 │ 51409600  │ 980426    │ 300      │
│ 4   │ test2 │ JSON     │ 91334  │ 50429936  │ 491747    │ 510      │
│ 5   │ test2 │ JSON2    │ 605269 │ 147471280 │ 4162257   │ 890      │

Note: It takes LazyJSON.jl a bit longer to access values near the end of
the input.

test3

Modifes ec2-2016-11-15.normal.json by replacing a value near the
start of the file and two values near the end.

Variants:

Lazy: LazyJSON.jl getproperty interface finds values and
LazyJSON.splice modifies the JSON data in-place.
JSON: JSON.jl parse to Dict, modify, then write new JSON text.
JSON2: Parses to immutable NamedTuples. Modificaiton not supported.

results = 2×6 DataFrame
│ Row │ Test  │ Variant │ μs     │ bytes     │ poolalloc │ bigalloc │
├─────┼───────┼─────────┼────────┼───────────┼───────────┼──────────┤
│ 1   │ test3 │ Lazy    │ 235024 │ 880768    │ 33622     │ 0        │
│ 2   │ test3 │ JSON    │ 671735 │ 126950528 │ 1407838   │ 1021     │

test4

Reads a 1.2MB GeoJSON file an extracts a country name near the middle
of the file.

Variants:

Lazy: LazyJSON.parse(j)["features"][15]["properties"]["formal_en"]
Lazy (B): LazyJSON.parse(j; getproperty=true).features[15].properties.formal_en
Lazy (C): LazyJSON.parse(j; lazy=false)["features"][15]["properties"]["formal_en"]
JSON: JSON.parse(j)["features"][15]["properties"]["formal_en"]
JSON2: JJSON2.read(j).features[15].properties.formal_en

results = 5×6 DataFrame
│ Row │ Test  │ Variant  │ μs    │ bytes    │ poolalloc │ bigalloc │
├─────┼───────┼──────────┼───────┼──────────┼───────────┼──────────┤
│ 1   │ test4 │ Lazy     │ 310   │ 2288     │ 115       │ 0        │
│ 2   │ test4 │ Lazy (B) │ 312   │ 2432     │ 121       │ 0        │
│ 3   │ test4 │ Lazy (C) │ 40696 │ 13134624 │ 462247    │ 48       │
│ 4   │ test4 │ JSON     │ 41609 │ 6336752  │ 135146    │ 100      │
│ 5   │ test4 │ JSON2    │ 84167 │ 22868160 │ 477011    │ 48       │

Note: LazyJSON.jl in non-lazy mode is a bit faster than JSON.jl for this
input.

test5

Reads a 1.2MB GeoJSON file and checks that the outline polygon for
a single country is within an expected lat/lon range.

r = r["features"][15]["geometry"]["coordinates"][6][1]
@assert r[1][1] == 134.41651451900023
for (x, y) in r
   @assert 134.2 < x < 134.5
   @assert 7.21 < y < 7.32
end

results = 3×6 DataFrame
│ Row │ Test  │ Variant │ μs    │ bytes    │ poolalloc │ bigalloc │
├─────┼───────┼─────────┼───────┼──────────┼───────────┼──────────┤
│ 1   │ test5 │ Lazy    │ 399   │ 22992    │ 967       │ 0        │
│ 2   │ test5 │ JSON    │ 40635 │ 6340592  │ 135296    │ 100      │
│ 3   │ test5 │ JSON2   │ 81213 │ 22872000 │ 477161    │ 48       │

test6

Defines struct Operation, struct IOType and struct HTTP with
fields that match the API operations data in ec2-2016-11-15.normal.json.
It then does JSON2-style direct-to-struct parsing to read the JSON data
into a Julia object Dict{String,Operation}
(LazyJSON provides @generated Base.convert methods for this).

Variants:

Lazy: LazyJSON.jl AbstractDict interface.
convert(Dict{String,Operation}, LazyJSON.parse(j))
JSON2: JSON2.jl read -> NamedTuple interface.
JSON2.read(j, Dict{String,Operation})

results = 2×6 DataFrame
│ Row │ Test  │ Variant │ μs    │ bytes   │ poolalloc │ bigalloc │
├─────┼───────┼─────────┼───────┼─────────┼───────────┼──────────┤
│ 1   │ test6 │ Lazy    │ 6866  │ 1125600 │ 39538     │ 16       │
│ 2   │ test6 │ JSON2   │ 13096 │ 3427888 │ 135789    │ 60       │

Note:
For all of the above tests, the content of ec2-2016-11-15.normal.json has been
duplicated 10 times into a top level JSON array “[ , , , …]” this
results in an overall input data size of ~10MB.

fengyang.wang · November 4, 2018, 4:45pm

Thanks for running these tests. It is not surprising to me that the namedtuples incur a high cost, as JSON is often used for relatively “unstructured” data, so specializations need to be compiled for a large number of types. It is also not surprising that the lazy parser does much better on tests that do not require reading the entire input.

My feeling is that it is most sensible to have both a greedy and a lazy parser, as both have their use cases. The confusion for the end user could potentially be alleviated by abstracting the JSON (which could be renamed SimpleJSON) and LazyJSON packages, and providing a uniform interface with appropriate keyword arguments lazy=false, strict=false, etc., somewhat similar to how the DifferentialEquations.jl ecosystem is structured. I believe that this would be better than having distinct JSON and LazyJSON packages both provide duplicated functionality.

On a related note, there are some outstanding changes proposed by @ScottPJones that could potentially improve the performance of JSON.jl, but we would need some updating and review on those old pull requests.

kristoffer.carlsson · November 4, 2018, 5:13pm

It would also be interesting to measure some other state of the art JSON parsers to give us an idea where we are right now compared to the rest of the world.

samoconnor · November 5, 2018, 2:26am

other state of the art JSON parsers

These would probably be a good starting point…

samoconnor · November 5, 2018, 4:03am

I’ve also just tried running some of the LazyJSON.jl correctness tests agains JSON.jl and JSON2.jl.
For the most part, JSON.jl only fails tests that the NST JSONTestSuite classes as optional:

https://github.com/JuliaIO/JSON.jl/issues/267

https://github.com/JuliaIO/JSON.jl/issues/266

https://github.com/quinnj/JSON2.jl/issues/19

https://github.com/quinnj/JSON2.jl/issues/18

yakir12 · November 5, 2018, 9:26am

Just to chime in: I find lazy reading of JSON data very useful, but like @fengyang.wang mentioned, it’s kind of hard to juggle slightly different APIs (that of JSON, JSON2, and LazyJSON) for slightly different tasks (explicitly/lazily reading/writing simple/complex types). So I just default to JSON and make do.

samoconnor · November 6, 2018, 2:40am

@yakir12, @fengyang.wang,

I agree that the multiple packages are confusing. I’ve made some notes about a proposed unified API here:

github.com/JuliaIO/JSON.jl

Proposal: Unified JSON API

opened 02:28AM - 06 Nov 18 UTC

samoconnor

This issue follows [Discourse comments about unifying the APIs of JSON.jl, LazyJ…SON.jl and JSON2.jl](https://discourse.julialang.org/t/json-performance-tests/17133/6). ### 1. Define julia types for JSON Values The current API uses `Base.String` to represent encoded JSON and `Base.Dict` etc to represent decoded JSON. The functions `JSON.parse` and `JSON.json` are used to convert between the two representations. This API restricts the implementation to be non-lazy. It also precludes the possibility of implementing short-cut methods for JSON derived values. [JavaScript Object Notation](https://www.json.org) defines 6 value types. Defining Julia types to represent these JSON value types will enable us to hide the implementation details (lazy vs eager parsing, encoded string representation vs decoded AST representation, etc). Treating JSON values as first class types (rather than as something that must be converted to a `Base` type) allows dispatch on these types and transparent implementation of short-cut methods as needed for efficiency. ```julia const JSON.Value = Union{ JSON.Object, JSON.Array, JSON.Number, JSON.String, JSON.Bool, JSON.Null } ``` ### 2. Construct JSON Value objects from strings The implementation might immediately parse encoding strings into Julia collection types, or it might parse to intermediate AST types, or it might just lazily wrap the encoded string. That implementation detail would be hidden from the user (unless there are compelling use cases where user-supplied implementation hints are a big performance win, e.g. a `lazy=false` option). It seems likely that a combination of sensible defaults and heuristics can achieve good performance in most cases without any need for the user to fiddle with options. ``` """ JSON.Value(::AbstractString)::JSON.Value Create a JSON object from a JSON formatted string. """ ``` ```julia julia> x = JSON.Value("""{ "object": {"field": "value"}, "array": [1,2,3], "number": 43, "bool": true, "null": null }""") julia> x.object.field "value" julia> x.array[1] 1 ``` ### 2. Construct JSON Value objects from julia objects. The implementation might immediately encode the julia objects to a JSON string, or it might just wrap them and do nothing, or it might convert them to an intermediate representation. That detail is hidden from the user. ``` """ JSON.Value(o)::JSON.Value Create a JSON object from a Julia object. """ ``` ```julia julia> x = JSON.Value(Dict( "object" => Dict("field" => "value"), "array" => [1,2,3] "number" => 43, "bool" => true, "null" => nothing )) julia> x.object.field "value" julia> x.array[1] 1 ``` ### 3. Use `Base.string` to produce JSON encoded strings. Rather than using `JSON.json` to produce encoded strings, just use `Base.string`. Depending on the `JSON.Value` implementation, `string` might just return a preexisting encoded string, or it might have to produce an encoded string from an internal representation. ``` Base.string(o::JSON.Value)::AbstractString JSON formatted string representation of a JSON object. ``` ```julia julia> x = JSON.Value(Dict( "object" => Dict("field" => "value"), "array" => [1,2,3] "number" => 43, "bool" => true, "null" => nothing )) julia> string(x) "{\"object\":{\"field\":\"value\"},\"array\":[1,2,3],\"number\":43,\"bool\":true,\"null\":null}" ``` ### 4. Use `Base.convert` to do direct-to-struct parsing. e.g. like the direct-to-string parsing feature first implemented in JSON2: ```julia julia> struct MyType field end julia> convert(MyType, JSON.Value("""{"field": "value"}""") MyType("value") ``` The `convert` methods would be [`@generated`](https://github.com/samoconnor/LazyJSON.jl/blob/master/src/AbstractDict.jl#L76-L89). ### 5. Use `Base.convert` in cases when specific `Base` types are needed. Most of the time `JSON.Value` types that implement `AbstractDict`, `AbstractArray`, `Base.Real`, `AbstractString` etc are all that the user will need. In cases where the user wants a specific type, they can use `convert`: ```julia julia> convert(Vector{Float64}, JSON.Value("[0.25, 0.5, 1, 2, 4, 8]")) 6-element Array{Float64,1}: 0.25 0.5 1.0 2.0 4.0 8.0 ``` ### 6. Backwards compatibility The existing API could be maintained as follows: ```julia JSON.parse(x; kw...) = JSON.Value(x; kw...) JSON.json(x) = string(JSON.Value(x)) ``` We could implement `parse` so that it produces lazy value objects by default and produces non-lazy values only when the `dicttype=` or `inttype=` options are supplied. Or we could disable laziness entirely for the `JSON.parse` interface and say "if you want the new lazy thing, use `JSON.Value`". If returning an `AbstractDict` from `parse` instead of a `Dict` causes breakage (or performance regression) in existing code, then we should start out by returning `Dict`. ### 7. Implementation We can cherry-pick implementation detail from the various existing JSON codebases. i.e. Use the fast float decoder from over here, but use the more robust UTF-16 decoder from there. It might turn out that using the lazy parser is just as fast as the non-lazy one for doing a full non-lazy parse. In that case we may only need one parser. Or, if there are cases where the existing non-lazy parser has big wins, we can keep both. The user should not be able to tell the difference.

My feeling is that it is most sensible to have both a greedy and a lazy parser, as both have their use cases.

I am hopeful that with sufficient tweaking the lazy parser can perform as well as a greedy parser in most use cases. However, that can only be proven or disproven by testing. The first step is to decide on an API that can support both implementations. We might end up with a lazy=true/false flag, or we might end up with an implementation that combines laziness with caching to handle degenerate access patterns.

yakir12 · November 6, 2018, 6:09am

I think

is totally fine, at least to begin with. We could have a rule of thumb that if you need to access only 10% of the available keys then you should use lazy=true or some such.

Topic		Replies	Views
Announce: A different way to read JSON data, LazyJSON.jl Data	19	10052	October 2, 2018
[ANN] JSON3.jl - Yet another JSON package for Julia Package Announcements	23	10642	September 19, 2020
Fastest JSON parser to julia Specific Domains ccall , json , cwrap	15	3578	November 6, 2020
Why does Julia not support JSON syntax to create a Dict? Internals & Design json	27	6724	April 25, 2022
Initial version of my first package: A JSON Lines reader Package Announcements package	42	3073	November 16, 2020

JSON Performance Tests

test1

test2

test3

test4

test5

test6

Related topics