JSON Performance Tests

I noticed that @kristoffer.carlsson has been working on parsing performance in JSON.jl: A few performance improvements by KristofferC · Pull Request #263 · JuliaIO/JSON.jl · GitHub … and it got me thinking.

When I first created LazyJSON.jl I targeted Julia 0.7-dev at a time when there were a lot of string changes going on. At the time performance comparisons with JSON.jl and JSON2.jl would vary wildly between Julia 0.7-dev nightly builds as various deprecation penalties came and went. Now that Julia 1.0 is out and JSON.jl and JSON2.jl have been updated for Julia 1.0, it seems like a good time to compare performance again.

These tests compare LazyJSON.jl to @kristoffer.carlsson’s kc/opt branch and JSON2.jl v0.2.3.

Test code is here: LazyJSON.jl/benchmark.jl at master · JuliaCloud/LazyJSON.jl · GitHub

LazyJSON.jl seems to take about the same time as JSON.jl to do flat non-lazy parsing to Julia Dict/Array etc (sometimes a bit faster, sometimes a bit slower).
In lazy mode LazyJSON.jl is orders of magnitude faster when only part of the input data is used.

@fengyang.wang what are your thoughts on either adding lazy parsing as an option in JSON.jl or replacing the current parser with a lazy parser? (One thing that would have to be done is to add a validation option to do strict JSON syntax checking for use cases where that is required. As it stands the Lazy parser will ignore some JSON syntax errors due to lazyness).

@quinnj Can you comment on the JSON2 results? I know that JSON2 is optimised for marshalling/unmarshalling, so perhaps my tests that result in JSON2 returning NamedTuples are a degenerate case. In test6 I tried to test JSON2’s direct-to-struct parsing in a way that seems to me to be “how JSON2 is intended to be used”, but it still seems a bit slow. Perhaps you can suggest a test case that would best demonstrate JSON2’s strengths.

@Nosferican I believe that you have been using LazyJSON.jl in NCEI.jl. Do you have any feedback from your use of the package?

test1

Reads ec2-2016-11-15.normal.json and extracts a single value:
operations.AcceptReservedInstancesExchangeQuote.input.shape.
This value is close to the start of the input data.

Variants:

  • Lazy: LazyJSON.jl AbstractDict interface.
  • Lazy (B): LazyJSON.jl getproperty interface.
  • Lazy (C): LazyJSON.jl lazy=false (parse whole input to Dicts etc like JSON.jl does)
  • JSON: JSON.jl parse interface.
  • JSON2: JSON2.jl read -> NamedTuple interface.
results = 5×6 DataFrame
│ Row │ Test  │ Variant  │ μs     │ bytes     │ poolalloc │ bigalloc │
├─────┼───────┼──────────┼────────┼───────────┼───────────┼──────────┤
│ 1   │ test1 │ Lazy     │ 54     │ 5184      │ 269       │ 0        │
│ 2   │ test1 │ Lazy (B) │ 51     │ 5408      │ 277       │ 0        │
│ 3   │ test1 │ Lazy (C) │ 105628 │ 51409504  │ 980424    │ 300      │
│ 4   │ test1 │ JSON     │ 103870 │ 50429936  │ 491747    │ 510      │
│ 5   │ test1 │ JSON2    │ 609448 │ 147471280 │ 4162257   │ 890      │

Note: LazyJSON.jl is similar to JSON.jl in speed and memory use in non-lazy mode.

test2

Read ec2-2016-11-15.normal.json and extracts an array value:
shapes.scope.enum
This value is close to the end of the input data.

Variants:

  • Lazy: LazyJSON.jl AbstractDict interface.
  • Lazy (B): LazyJSON.jl getproperty interface.
  • Lazy (C): LazyJSON.jl lazy=false (parse whole input to Dicts etc)
  • JSON: JSON.jl parse interface.
  • JSON2: JSON2.jl read -> NamedTuple interface.
results = 5×6 DataFrame
│ Row │ Test  │ Variant  │ μs     │ bytes     │ poolalloc │ bigalloc │
├─────┼───────┼──────────┼────────┼───────────┼───────────┼──────────┤
│ 1   │ test2 │ Lazy     │ 11035  │ 3296      │ 162       │ 0        │
│ 2   │ test2 │ Lazy (B) │ 11028  │ 3440      │ 168       │ 0        │
│ 3   │ test2 │ Lazy (C) │ 115045 │ 51409600  │ 980426    │ 300      │
│ 4   │ test2 │ JSON     │ 91334  │ 50429936  │ 491747    │ 510      │
│ 5   │ test2 │ JSON2    │ 605269 │ 147471280 │ 4162257   │ 890      │

Note: It takes LazyJSON.jl a bit longer to access values near the end of
the input.

test3

Modifes ec2-2016-11-15.normal.json by replacing a value near the
start of the file and two values near the end.

Variants:

  • Lazy: LazyJSON.jl getproperty interface finds values and
    LazyJSON.splice modifies the JSON data in-place.
  • JSON: JSON.jl parse to Dict, modify, then write new JSON text.
  • JSON2: Parses to immutable NamedTuples. Modificaiton not supported.
results = 2×6 DataFrame
│ Row │ Test  │ Variant │ μs     │ bytes     │ poolalloc │ bigalloc │
├─────┼───────┼─────────┼────────┼───────────┼───────────┼──────────┤
│ 1   │ test3 │ Lazy    │ 235024 │ 880768    │ 33622     │ 0        │
│ 2   │ test3 │ JSON    │ 671735 │ 126950528 │ 1407838   │ 1021     │

test4

Reads a 1.2MB GeoJSON file an extracts a country name near the middle
of the file.

Variants:

  • Lazy: LazyJSON.parse(j)["features"][15]["properties"]["formal_en"]
  • Lazy (B): LazyJSON.parse(j; getproperty=true).features[15].properties.formal_en
  • Lazy (C): LazyJSON.parse(j; lazy=false)["features"][15]["properties"]["formal_en"]
  • JSON: JSON.parse(j)["features"][15]["properties"]["formal_en"]
  • JSON2: JJSON2.read(j).features[15].properties.formal_en
results = 5×6 DataFrame
│ Row │ Test  │ Variant  │ μs    │ bytes    │ poolalloc │ bigalloc │
├─────┼───────┼──────────┼───────┼──────────┼───────────┼──────────┤
│ 1   │ test4 │ Lazy     │ 310   │ 2288     │ 115       │ 0        │
│ 2   │ test4 │ Lazy (B) │ 312   │ 2432     │ 121       │ 0        │
│ 3   │ test4 │ Lazy (C) │ 40696 │ 13134624 │ 462247    │ 48       │
│ 4   │ test4 │ JSON     │ 41609 │ 6336752  │ 135146    │ 100      │
│ 5   │ test4 │ JSON2    │ 84167 │ 22868160 │ 477011    │ 48       │

Note: LazyJSON.jl in non-lazy mode is a bit faster than JSON.jl for this
input.

test5

Reads a 1.2MB GeoJSON file and checks that the outline polygon for
a single country is within an expected lat/lon range.

r = r["features"][15]["geometry"]["coordinates"][6][1]
@assert r[1][1] == 134.41651451900023
for (x, y) in r
   @assert 134.2 < x < 134.5
   @assert 7.21 < y < 7.32
end
results = 3×6 DataFrame
│ Row │ Test  │ Variant │ μs    │ bytes    │ poolalloc │ bigalloc │
├─────┼───────┼─────────┼───────┼──────────┼───────────┼──────────┤
│ 1   │ test5 │ Lazy    │ 399   │ 22992    │ 967       │ 0        │
│ 2   │ test5 │ JSON    │ 40635 │ 6340592  │ 135296    │ 100      │
│ 3   │ test5 │ JSON2   │ 81213 │ 22872000 │ 477161    │ 48       │

test6

Defines struct Operation, struct IOType and struct HTTP with
fields that match the API operations data in ec2-2016-11-15.normal.json.
It then does JSON2-style direct-to-struct parsing to read the JSON data
into a Julia object Dict{String,Operation}
(LazyJSON provides @generated Base.convert methods for this).

Variants:

  • Lazy: LazyJSON.jl AbstractDict interface.
    convert(Dict{String,Operation}, LazyJSON.parse(j))
  • JSON2: JSON2.jl read -> NamedTuple interface.
    JSON2.read(j, Dict{String,Operation})
results = 2×6 DataFrame
│ Row │ Test  │ Variant │ μs    │ bytes   │ poolalloc │ bigalloc │
├─────┼───────┼─────────┼───────┼─────────┼───────────┼──────────┤
│ 1   │ test6 │ Lazy    │ 6866  │ 1125600 │ 39538     │ 16       │
│ 2   │ test6 │ JSON2   │ 13096 │ 3427888 │ 135789    │ 60       │

Note:
For all of the above tests, the content of ec2-2016-11-15.normal.json has been
duplicated 10 times into a top level JSON array “[ , , , …]” this
results in an overall input data size of ~10MB.

4 Likes

Thanks for running these tests. It is not surprising to me that the namedtuples incur a high cost, as JSON is often used for relatively “unstructured” data, so specializations need to be compiled for a large number of types. It is also not surprising that the lazy parser does much better on tests that do not require reading the entire input.

My feeling is that it is most sensible to have both a greedy and a lazy parser, as both have their use cases. The confusion for the end user could potentially be alleviated by abstracting the JSON (which could be renamed SimpleJSON) and LazyJSON packages, and providing a uniform interface with appropriate keyword arguments lazy=false, strict=false, etc., somewhat similar to how the DifferentialEquations.jl ecosystem is structured. I believe that this would be better than having distinct JSON and LazyJSON packages both provide duplicated functionality.

On a related note, there are some outstanding changes proposed by @ScottPJones that could potentially improve the performance of JSON.jl, but we would need some updating and review on those old pull requests.

2 Likes

It would also be interesting to measure some other state of the art JSON parsers to give us an idea where we are right now compared to the rest of the world.

3 Likes

other state of the art JSON parsers

These would probably be a good starting point…

I’ve also just tried running some of the LazyJSON.jl correctness tests agains JSON.jl and JSON2.jl.
For the most part, JSON.jl only fails tests that the NST JSONTestSuite classes as optional:

https://github.com/JuliaIO/JSON.jl/issues/267

https://github.com/JuliaIO/JSON.jl/issues/266

https://github.com/quinnj/JSON2.jl/issues/19

https://github.com/quinnj/JSON2.jl/issues/18

1 Like

Just to chime in: I find lazy reading of JSON data very useful, but like @fengyang.wang mentioned, it’s kind of hard to juggle slightly different APIs (that of JSON, JSON2, and LazyJSON) for slightly different tasks (explicitly/lazily reading/writing simple/complex types). So I just default to JSON and make do.

@yakir12, @fengyang.wang,

I agree that the multiple packages are confusing. I’ve made some notes about a proposed unified API here:

My feeling is that it is most sensible to have both a greedy and a lazy parser, as both have their use cases.

I am hopeful that with sufficient tweaking the lazy parser can perform as well as a greedy parser in most use cases. However, that can only be proven or disproven by testing. The first step is to decide on an API that can support both implementations. We might end up with a lazy=true/false flag, or we might end up with an implementation that combines laziness with caching to handle degenerate access patterns.

2 Likes

I think

is totally fine, at least to begin with. We could have a rule of thumb that if you need to access only 10% of the available keys then you should use lazy=true or some such.