Announce: A different way to read JSON data, LazyJSON.jl


#1

LazyJSON.jl implements yet another a different way of reading JSON data in Julia. I wrote this as a proof of concept and it is probably not production ready, but if you work with JSON data in a performance sensitive application, this approach might be beneficial. Documentation is in README.md.

LazyJSON.jl provides direct access to values stored in a JSON text though standard Julia interfaces: Number, AbstractString, AbstractVector and AbstractDict.

LazyJSON is lazy in the sense that it does not process any part of the JSON text until values are requested through the AbstractVector and AbstractDict interfaces.

i.e. j = LazyJSON.value(jsontext) does no parsing and immediately
returns a thin wrapper object.

j["foo"] calls get(::AbstractDict, "foo"), which parses just enough to find
the "foo" field.

j["foo"][4] calls getindex(::AbstractArray, 4), which continues paring up to
the fourth item in the array.

This results in much less memory allocation compared to non-lazy parsers:

JSON.jl:

j = String(read("ec2-2016-11-15.normal.json"))
julia> function f(json)
           v = JSON.parse(json)
           v["shapes"]["scope"]["enum"][1]
       end

julia> @time f(j)
  0.066773 seconds (66.43 k allocations: 7.087 MiB)
"Availability Zone"

LazyJSON.jl:

julia> function f(json)
           v = LazyJSON.parse(json)
           v["shapes"]["scope"]["enum"][1]
       end

julia> @time f(j)
  0.001392 seconds (12 allocations: 384 bytes)
"Availability Zone"

LazyJSON’s AbstractString and Number implementations are lazy too.

The text of a LazyJSON.Number is not parsed to Int64 or Float64 form until it is needed for a numeric operation. If the number is only used in a textual context, it need never be parsed at all. e.g.

j = LazyJSON.value(jsontext)
html = """<img width=$(j["width"]), height=$(j["height"])>"""

Likewise, the content of a LazyJSON.String is not interpreted until it is accessed. If a LazyJSON.String containing complex UTF16 escape sequences is compared to a UTF8 Base.String, and the two strings differ in the first few characters, then the comparison will terminate before the any unescaping work needs to be done.

The values returned by LazyJSON consist of a reference to the complete JSON text String
and the byte index of the value text. The LazyJSON.value(jsontext) function
simply returns a LazyJSON.Value object with s = jsontext and i = 1.

    String: {"foo": 1,    "bar": [1, 2, 3, "four"]}
            ▲                    ▲      ▲  ▲
            │                    │      │  │
            ├─────────────────┐  │      │  │
            │ LazyJSON.Array( s, i=9)   │  │   == Any[1, 2, 3, "four"]
            │                           │  │
            ├─────────────────┐  ┌──────┘  │
            │ LazyJSON.Number(s, i=16)     │   == 3
            │                              │
            ├─────────────────┐  ┌─────────┘
            │ LazyJSON.String(s, i=19)         == "four"
            │
            └─────────────────┬──┐
              LazyJSON.Object(s, i=1)

Mirco Zeiss’ 180MB citylots.json file provides a nice demonstration of the potential performance benefits:

julia> j = String(read("citylots.json"));
julia> const J = LazyJSON
julia> function load_coords(j, n)
           d = DataFrame(x = Float64[], y = Float64[], z = Float64[])
           for x in J.parse(j)["features"][n]["geometry"]["coordinates"]
               for v in x
                   push!(d, v)
               end
           end
           return d
       end

Accessing data near the start of the file is really fast:

julia> @time load_coords(j, 1)
  0.000080 seconds (128 allocations: 5.438 KiB)
5×3 DataFrame
│ Row │ x        │ y       │ z   │
├─────┼──────────┼─────────┼─────┤
│ 1   │ -122.422 │ 37.8085 │ 0.0 │
...
│ 5   │ -122.422 │ 37.8085 │ 0.0 │

Accessing the last record in the file is slower, but memory use stays low:

julia> @time load_coords(j, 206560)
  0.236713 seconds (217 allocations: 8.422 KiB)
11×3 DataFrame
│ Row │ x        │ y       │ z   │
├─────┼──────────┼─────────┼─────┤
│ 1   │ -122.424 │ 37.7829 │ 0.0 │
...
│ 11  │ -122.424 │ 37.7829 │ 0.0 │

The non-lazy standard JSON.jl parser uses more time and more memory irrespective of the location of the data:

julia> @time load_coords(j, 1)
  6.097472 seconds (52.32 M allocations: 1.738 GiB, 36.45% gc time)
5×3 DataFrame

However, the standard parser produces a tree of Array and Dict objects that provide fast access to all the values in the JSON text. So the overhead of parsing the whole file may be worth it if you need random access to multiple values. More information on the performance tradeoffs of lazy parsing is here: LazyJSON Performance Considerations.


JSON type serialization
#2

I’ve been playing around with @quinnj’s JSON2.jl package a little more in the last few days. For me the two standout features are:

  • '.' notation (getproperty) for field access (in JSON2 this is achived by using NamedTuples)
  • automagically filling in a struct from a JSON text using JSON2.read(json_string, StructType)

I’ve been inspired to add similar capabilities to the latest version of LazyJSON by:

  • putting a getproperty wrapper around LazyJSON.Object so that 'o.field' is syntaxtic sugar for o["field"] (the field lookup is still done by the usual lazy getindex method).
  • adding a Base.convert(::Type{T}, ::LazyJSON.Object} method that constructs new structs from JSON text.

e.g. '.' notation for fields:

julia> arrow_json =
       """{
           "label": "Hello",
           "segments": [
                {"a": {"x": 1, "y": 1}, "b": {"x": 2, "y": 2}},
                {"a": {"x": 2, "y": 2}, "b": {"x": 3, "y": 3}}
            ],
            "dashed": false
       }"""

julia> lazy_arrow = LazyJSON.value(arrow_json)

julia> lazy_arrow.segments[1].a.x
1

e.g. convert to struct:

julia> struct Point
           x::Int
           y::Int
       end

julia> struct Line
           a::Point
           b::Point
       end

julia> struct Arrow
           label::String
           segments::Vector{Line}
           dashed::Bool
       end

julia> convert(Arrow, lazy_arrow)
Arrow("Hello", Line[Line(Point(1, 1), Point(2, 2)), Line(Point(2, 2), Point(3, 3))], false)

Implementation notes:

'.' notation is implemented by:
Base.getproperty(d::PropertyDict, n::Symbol) = getindex(d, String(n))

convert is implemented by a @generated function that generates somthing like:

function convert{::Type{Arrow},  o::JSON.Object)
    i = o.i
    i, label = get_field(o.s, "label", i)
    i, segments = get_field(o.s, "segments", i)
    i, dashed = get_field(o.s, "dashed", i)
    Arrow(label, segments, dashed)
end

The local variable i keeps track of the current string index to make finding the fields faster (if they are in the expected order). The call to get_field(o.s, "segments", i) returns a LazyJSON.Array object. When this is passed to the Arrow constructer, Julia automatically calls convert(Vector{Line}, ...) to convert it to the type of the struct field. In this way the conversion process works recursively for arbitrarily nested structs.

But, if you care about minimising copying and memory allocation, and you don’t need to access all the fields in a JSON object, it might be better to do away with structs and use the '.' notation feature to access the JSON values in-place. In my experience with web services APIs it is quite common to have a whole mess of JSON returned by an API request, when I’m only interested in a few particular fields.


#3

I’ve noticed that Julia’s existing JSON parsers all support parsing from an IO stream as well as from a String or a Vector{UInt8}.

I’m not sure that I understand the use case for this. I would assume that if the JSON data is in a disk file it is best to use mmap to read it. And, in my experience of receiving a JSON from a network stream, there is usually a lower level framing protocol gives you a complete JSON message in one piece (e.g. HTTP Content-Length).

Is the advantage of parsing JSON from a network stream simply that parsing can begin before the whole message is received? (it seems that with non-lazy parsers, the whole message must be received before parsing can complete).

I would be interested to know if anyone is using a JSON parser to read from a network stream in Julia.

The LazyJSON parser assumes that the JSON text is available as one big String. It’s data structures are just references into this String. Therefore, it seems like the way to handle IO stream input in LazyJSON is to read the IO stream into a String, then let the lazy parser do its usual thing with the string.

I’ve added a new type IOString{T <: IO} <: AbstractString that reads data from an IO stream into an IOBuffer and makes the contents of the IOBuffer accessible through the AbstractString interface.

struct IOString{T <: IO} <: AbstractString
    io::T
    buf::IOBuffer
end

The lazy parser can parse JSON values from the IOString just like any other string. When the parser unexpectedly reaches the end of the string it throws a LazyJSON.ParseError. I’ve implemented some exception handling magic that reads the next chunk of data from the IO stream and resumes parsing. This is all transparent to the user. If you do j = LazyJSON.value(network_stream), then do println(j.user[7].permissions) the the lazy parser will parse enough to get the the requested value (and blocking to wait for more data as needed).

Details are in the documentation, examples are in the tests.

I hope that this interface provides a simple way to get JSON values form a stream with minimum overhead. I hope that the ability to stop when the required values have been extracted is an efficiency win in some use cases. I also imagine this interface could support an infinite stream of JSON data. e.g. a JSON Array that never ends.

I would like to hear from people who’ve been using JSON with streams in Julia. What interface would be most convenient for you?


#4

I think the main reason is to avoid copying. Unlike your lazy parser, which needs to keep the underlying data around until it is actually more completely parsed, JSON.jl and JSON2.jl create new string or numeric objects (String or Float64 unless the types have been overridden).

For your LazyJSON parser, that’s an interesting approach.
Note that JSON can be (by the standard) encoded in any one of 5 forms (UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE). Because of the way the JSON standard works, the encoding can be detected (with 100% certainty) by only reading the first 4 bytes. (A JSON value must start with a limited set of whitespace, {, [, ", -, a digit, or n, f, t).
I don’t believe any of the current JSON parsers handle that, it’s something I’d like to deal with in a “StrJSON.jl” package.


#5

I think the main reason is to avoid copying

That makes sense. I wonder if it ends up being a performance win? There must be overhead in doing all the parsing through the IO interface.

Note that JSON can be (by the standard) encoded in any one of 5 forms

As of RFC 8259, UTF-8 is required:
"JSON text exchanged between systems that are not part of a closed
ecosystem MUST be encoded using UTF-8"

https://tools.ietf.org/html/rfc8259#section-8.1


#6

Yes - going over the code in JSON.jl, I wondered about why all that complication was necessary.
Reading in chunks as big as are available, and having the “eof” checking code check if more has become available, seems to me like it would be faster. Do you have any benchmarks of doing that, when you are getting a lot of the values out of the JSON file, using your LazyJSON?

Yes, I was aware of RFC 8259, I didn’t know it had been accepted yet by the IETF, but that doesn’t really change things - you still may find JSON stored (legally per previous RFCs and ECMA 404) with the 4 other Unicode encodings they previously allowed, and there’s no reason not to accept them, if your string handling can deal with them.


#7

Not yet, part of my reason for asking about how people use JSON with IO is to help create a realistic benchmark. Knowing about the usual underlying packet size and typical fragmentation boundaries can help optimising this sort of thing. e.g. I can’t believe anyone is actually reading JSON from a non-bufferd RS232 port in JULIA, so reading 1 character at a time is not a relevant use case. My guess is that most network streams have a large enough packet and/or buffer size that the overhead of exception handling at temporary-EOF is low.

one reason might be lazyness :wink:


#8

When I did a fast JSON parser for CachéObjectScript, it wasn’t just network streams, but also dealing with huge JSON files (multi gigabyte), so I had something that would return a string (of the next chunk size usually) so it was rare to have to call out to see whether it was really at the end of file (once every 64K characters, typically). Something like mmap wouldn’t have worked on 32-bit systems, because the files were larger than the address space.
So, yes, in my previous experience, you are correct.


#9

About LazyJSON.splice:

So, I was thinking about how LazyJSON should do updates.
Since it returns AbstractDict, you can already use Base.merge to create a modified Dict, but that has about the same overhead as JSON.jl.

However, there is an update use case where LazyJOSN can do quite well: you have a largish JSON text and you need to update a few values. LazyJSON.splice works entirely in the text string domain and requires only a few lines of code (currently only replacing values is supported, but inserting new values is possible too):

splice(j::JSON.Value, v::JSON.Value, x) = value(splice(j.s, v.i, x, j.i))

splice(s::AbstractString, i::Int, x, start_i = 1) = 
    string(SubString(s, start_i, i - 1),
           jsonstring(x),
           SubString(s, lastindex_of_value(s, i) + 1))

e.g.

j = """{
    "id": 1296269,
    "owner": {
        "login": "octocat"
    },
    "parent": {
        "name": "test-parent"
    }
}"""

@test LazyJSON.splice(j, ["owner", "login"], "foo") ==
"""{
    "id": 1296269,
    "owner": {
        "login": "foo"
    },
    "parent": {
        "name": "test-parent"
    }
}"""

@test LazyJSON.splice(j, ["owner"], "foo") ==
"""{
    "id": 1296269,
    "owner": "foo",
    "parent": {
        "name": "test-parent"
    }
}"""

j = LazyJSON.value(j)

LazyJSON.splice(j, j.owner.login, "foo")
LazyJSON.Object with 3 entries:
  "id"     => 1296269
  "owner"  => LazyJSON.Object("login"=>"foo")
  "parent" => LazyJSON.Object("name"=>"test-parent")

ANN: StringBuilders.jl
#10

Your approach reminds me a lot of an XML parser (VTD) that was also non-extractive.
I wonder if some operators might do things like return the value locations (and maybe type info?),
or iterate over them, in such a way that you could start off, search for the pattern(s) you want to replace,
simply copy into an IOBuffer (or write out to an arbitrary stream) the unchanged parts, replace or not, and continue on, and finally, when nothing more found, copy the rest of the input to the IOBuffer (or stream), and return it (or close it).

I think there’s quite a lot of mileage that can be gotten from your lazy approaches!


#11

splice benchmark:

julia> f = open("ec2-2016-11-15.normal.json", "r"); nothing
julia> j = String(Mmap.mmap(f)) ; nothing
julia> sizeof(j)
1035940

julia> function lazy()
           r = LazyJSON.value(j)
           r = LazyJSON.splice(r, r.shapes.scope.enum[1], "foo")
           s = string(r)
           r = LazyJSON.value(s)
           @assert r.shapes.scope.enum[1] == "foo"
       end

julia> function json()
           r = JSON.parse(j)
           r["shapes"]["scope"]["enum"][1] = "foo"
           s = JSON.json(r)
           r = JSON.parse(s)
           @assert r["shapes"]["scope"]["enum"][1] == "foo"
       end

julia> @time lazy()
  0.005153 seconds (56 allocations: 3.953 MiB)

julia> @time json()
  0.031933 seconds (236.48 k allocations: 13.528 MiB)

#12
for v in x
     push!(d, v)
end

Does DataFrame support an append!(d, x) operation? That might be even faster.


#13

Under certain conditions the parser can be specialized (e.g., ordered type-stable no missing values),

using Dates: Date
using DataFrames: DataFrame
using LazyJSON: parse

str = """
[ 
    { "a": 1, "b": 2, "c": "2010-01-01", "d": "A"},
    { "a": 0, "b": 3.5, "c": "2012-02-15", "d": "B"}
]"
"""
obj = parse(str)
function magic(obj::AbstractVector,
               names::Vector{Symbol},
               types::Vector{DataType})
    storage = Tuple(Vector{T}() for T ∈ types)
    for each ∈ obj
        for (output, T, val) ∈ zip(storage, types, values(each))
            push!(output, T(val))
        end
    end
    output = DataFrame()
    for (name, col) ∈ zip(names, storage)
        output[name] = col
    end
    return output
end

#14

@Nosferican, if you use triple quotes for your string, you can avoid all of those escape backslashes.

Cheers!
Kevin


#15

It’s from Dec. 2017, but I-JSON (short for “Internet JSON”) also requiring UTF-8 is from 2015: https://tools.ietf.org/html/rfc7493

Thus, I believe you can worry only about UTF-8. [Also while XML supports more, it’s the default; and I understand Google disregarded the standard, and announced they would only support XML in UTF-8.]


#16

In the real world, data encoded in older versions of a standard doesn’t just “disappear”.
Same thing with languages - just because there is a new Python standard, 3.7, doesn’t mean that there isn’t still a large amount of code written to the old standard, i.e. 2.7.


#17

With “Thus” I really had “Internet-JSON” (the restricted profile of JSON “designed to maximize interoperability”, that’s now the now JSON RFC 8259 it seems) in mind, i.e.:

and I believe all JSON exporting software (written in any language) emit for web-services use UTF-8 (even with JavaScript using UTF-16 internally). Yes, you may have old JSON files lying around (and he also mentioned mmap ), and in theory not in UTF-8. I do think his/other [JSON] software could maybe detect non-UTF-8, i.e. UTF-16 and throw an error. I’m not sure it’s a priority, at least not to support reading it in.

You can make an adaptor for his software. Or use iconv for files.

About detecting BOM/UTF-16 I think the standard library for strings should have the capability at least eventually. It’s been discussed, and an error/exception is not wanted by default, and I think I agree with that, but should be a non-default option.


#18

LazyJSON.jl now supports Julia 1.0: https://github.com/samoconnor/LazyJSON.jl/releases/tag/v0.0.5


#19

So JSON2.jl or LazyJSON.jl? I’m confused :sweat_smile: