[ANN] JSON3.jl - Yet another JSON package for Julia

I’m pleased to share the registration of a new package, JSON3.jl, in the General registry, available immediately.

Let’s cut right to the chase and answer the elephant questions in the proverbial discourse room: why do we need another JSON package in Julia? what does it offer distinct from what JSON.jl, JSON2.jl, or LazyJSON.jl offer? why spend time and effort developing something that’s “already solved”?

JSON3.jl was born from the spark of three separate ideas, and a vision that they could come together to make the best, most performant, simple, yet powerful JSON integration for Julia possible. It also exists as a way to “prove out” these ideas before trying to potentially upstream improvements into a more canonically named package like JSON.jl. I fully believe the package is ready for full-time use and reliance, but similar to JSON2.jl, it exists as a way to try out a different JSON integration API to potentially make things better, faster, easier.

Semi-Lazy Native Parsing

Taking a lazy approach to JSON parsing is not a new concept, (see LazyJSON.jl), but currently the approach in LazyJSON.jl has two slight disadvantages: 1) the initial object wrapping has high overhead for small JSON objects, and 2) the completely lazy approach to accessing key-value pairs introduces overhead when iterating an entire object. (see below for performance comparisons). JSON3.jl takes a semi-lazy parsing approach, where objects, arrays, and strings are parsed lazily, while numbers, booleans, and null values are parsed immediately. In addition, each object, array, and string carries its total self-length, making access of individual key-value pairs or iteration slightly faster by allowing the ability to skip over entire objects/arrays/strings.

Another powerful advantage of the semi-lazy approach in JSON3.jl is the ability to get strongly-typed JSON3.Array{T} when parsing JSON arrays. (Note that JSON3.Array{T} is a lazy, JSON3-defined type different from Base.Array{T, N}). The semi-lazy parsing approach allows JSON3.jl to identify homogenous arrays and “flag” the type with the concrete type that is parsed, which, when combined with usage of the array later, allows the Julia compiler to generate extremely efficient code (see array iteration benchmarks below).

This technique/feature is accessible in JSON3 via “native parsing” by calling JSON3.read(json_str); for strings, numbers, booleans, or null values, the direct value will be returned; for objects and arrays, custom JSON3.Object and JSON3.Array{T} objects will be returned, which employ the semi-lazy approach discussed. JSON3.Object implements the AbstractDict interface (acts like a Dict), and also allows for accessing key-value pairs via getproperty, like JavaScript, (i.e. you can do obj.keyname). JSON3.Array{T} implements the AbstractArray interface, so supports the normal iteration, getindex, etc.

Compiler-friendly Custom Struct Code Generation

In JSON2.jl, I wanted to provide really fast ways to read/write custom Julia structs for JSON. I took a bit of a “shotgun” approach by trying out at least 3 ways, using combinations of introspection, macros, and heavy use of @generated functions to generate struct-specific code. While the goal was achieved in simpler cases, the code was extremely complex, hard to maintain/edit, and inscrutable to those wishing to contribute. What’s more, there were a few worst-case scenarios where the code generation would get out of hand leading to entire application pauses because JSON2.jl was over-compiling for some crazy struct.

In JSON3.jl, the custom struct support has been overhauled to to be drastically simpler, achieve excellent performance, and avoid worst-case compiling scenarios; techniques utilized include:

  • relying on the compiler’s excellent capabilities to do struct introspection at compile-time
  • utilize similar techniques to Base.CartesianIndex for simple, straightforward code generation using Base.@nexpr and Base.@ncall
  • introduce code generation limits, specializing structs with < 32 fields, with fallbacks to handle larger cases

The equivalent code is several hundreds line smaller, more performant, understandable, and avoids any compiler danger zones.

A Novel, More Julian, Approach to Struct Mapping

The JSON3.jl approach to declaring how your struct should map to JSON begins with the assumption that every struct falls into one of two general categories: a “data” type or an “interface” type. “Data” types are defined as being basically a collection of properties that make up an object; the type exists to bundle related fields together to be operated on and that generally have some kind of semantic value when bundled together. Their natural JSON representation is as a JSON object where each field name is treated as a JSON key, and each field value as the corresponding JSON value. “Interface” types, on the other hand, have private, internal fields, and are mainly useful via the access patterns they define; many Base or library-provided structs are like this. For example, Base.Dict has several internal fields that are mostly cryptic if viewed on their own, but with powerful interface methods like getindex, setindex!, iteration of key-value pairs, the Dict provides a meaningful implementation of the “hash table” data structure. To map these kinds of structs to JSON, we definitely don’t want to consider their internal fields, but want to map them to one of the existing JSON object types: object, array, string, number, boolean, or null.

JSON3.jl defines a trait-based approach to conveniently declare the “JSON struct type” of a custom struct, using one of the following traits:

# data types
JSON3.StructType(::Type{T}) = JSON3.Struct()
JSON3.StructType(::Type{T}) = JSON3.Mutable()
# json types for interface types
JSON3.StructType(::Type{T}) = JSON3.ObjectType()
JSON3.StructType(::Type{T}) = JSON3.ArrayType()
JSON3.StructType(::Type{T}) = JSON3.StringType()
JSON3.StructType(::Type{T}) = JSON3.NumberType()
JSON3.StructType(::Type{T}) = JSON3.BoolType()
JSON3.StructType(::Type{T}) = JSON3.NullType()
# subtype dispatch for abstract types
JSON3.StructType(::Type{T}) = JSON3.AbstractType()

“Data” types will use one of the Struct or Mutable traits, while “interface” types will declare one of the JSON object types, and ensure they satisfy the required interface. The JSON3.AbstractType trait is for a specialized JSON reading scenario where the type of a JSON object is included as a key-value pair in the object itself, so a sort of “subtype dispatch” should be used to map JSON to the correct Julia struct.

Full documentation is provided for each JSON3.StructType trait, but please raise issues if something isn’t clear.

Sorry for the diatribe here, but hopefully it’s useful to here a little bit of context/background going into why another JSON package is being registered and publicized.

Benchmarks

LazyJSON.jl vs. JSON3.jl

Small object parse and iterate over each key-value pair:

julia> str = """{
       "a": 1,
       "b": 2,
       "c": 3
       }
       """
"{\n\"a\": 1,\n\"b\": 2,\n\"c\": 3\n}\n"

julia> @btime LazyJSON.value(str)
  199.637 ns (1 allocation: 32 bytes)
LazyJSON.Object{Nothing,String} with 3 entries:
  "a" => 1
  "b" => 2
  "c" => 3

julia> using JSON3

julia> @btime JSON3.read(str)
  108.741 ns (3 allocations: 384 bytes)
JSON3.Object{Base.CodeUnits{UInt8,String}} with 3 entries:
  :a => 1
  :b => 2
  :c => 3

julia> function access_each(obj)
           x = 0
           for (k, v) in obj
               x += v
           end
           return x
       end
access_each (generic function with 1 method)

julia> v = LazyJSON.value(str)
LazyJSON.Object{Nothing,String} with 3 entries:
  "a" => 1
  "b" => 2
  "c" => 3

julia> @btime access_each(v)
  1.723 μs (21 allocations: 672 bytes)
6

julia> v2 = JSON3.read(str)
JSON3.Object{Base.CodeUnits{UInt8,String}} with 3 entries:
  :a => 1
  :b => 2
  :c => 3

julia> @btime access_each(v2)
  1.402 μs (0 allocations: 0 bytes)
6

Sum elements of a number array:

julia> str = "[1,2,3,4,5,6,7,8,9,10]"
"[1,2,3,4,5,6,7,8,9,10]"

julia> a = LazyJSON.value(str)
10-element LazyJSON.Array{Nothing,String}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

julia> a2 = JSON3.read(str)
10-element JSON3.Array{Int64,Base.CodeUnits{UInt8,String}}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

julia> function access_each(arr)
           x = 0
           for v in arr
               x += v
           end
           return x
       end
access_each (generic function with 1 method)

julia> @btime access_each(a)
  5.398 μs (50 allocations: 1.56 KiB)
55

julia> @btime access_each(a2)
  26.463 ns (0 allocations: 0 bytes)
55
24 Likes

That’s really interesting. I really don’t like JSON, but I use it all the time because it’s the only non-tabular format that my colleagues may in principle be willing to ingest, even if they usually refuse to do so in practice. A system like this for having a clear, consistent, relatively safe way of serializing all of my objects to JSON without much effort is something I badly need.

I’d like to spend more time digging into this, but a few comments from the few minutes I spent taking a look:

  • One of my biggest worries when attempting to serilize objects into JSON is getting bad defaults. I have a lot of custom code for writing objects in to JSON’s, and the reason for most of it is fear that I will inadvertently write something in a format I don’t expect. For example, try doing JSON3.write(skipmissing([1,missing,3])) or
julia> v = sparse([1 0; 0 1])
2×2 SparseMatrixCSC{Int64,Int64} with 2 stored entries:
  [1, 1]  =  1
  [2, 2]  =  1

julia> JSON3.write(v)
"[1,0,0,1]"

I can’t speak for everyone, but that’s a big risk for me. I know there are ways to override this behavior, but in my case I’d be much happier (and safer) if the JSON package were more willing to fail. That way at least I’d know I have to do something, and not worry so much.

  • You might want to rethink the naming conventions of some of the functions. As you know in Julia usually the first argument of write is the thing being written to, like write(io, "hello"). I think the naming conventions of the original JSON were pretty good. I also find re-using universally recognized names from Base such as Array to be really confusing (especially for people working on the package itself), but I know opinions differ on this, and that some consider the ability to do this to be one of Julia’s great strengths.
  • I suggest using MacroTools.jl for macro codes, especially for the sake of your future self or anyone else who might want to work on the package. I don’t know if there’s as much here as in JSON2.jl.

Anyway, thanks for all your work on this! I could definitely see myself using this at some point, and I’d love to see JSON.jl get some input from this, whether that comes during a major overhaul or piecemeal.

4 Likes

Yeah, this is certainly worth thinking about; it would be pretty easy to change this line to something like JSON3.NoStructType(), and basically throw an error unless you specifically define a StructType. I feel inclined to do something like that, and if there are things from Base that happen to not be defined, we can add them as needed.

Just to clarify: what exactly do you see wrong with the SparseArray case? Would you expect it be written out as an array of arrays? The main issue w/ writing as an array of arrays is that it gets really difficult to roundtrip: trying to detect the case where you’re reading an array of arrays, then rearrange things to be a Matrix is definitely non-trivial. At least w/ writing it as a straight array, you can do a quick reshape after reading it back in to get back to what you had.

Yeah, I think using something like JSON3.Array{T} doesn’t feel right to some people, but personally I really like sticking to the JSON spec names in this case and it feels very natural, especially since it operates just like a Base.Array in most cases. I’m also a big fan of using these kinds of non-exported API names (CSV.read, JSON3.read, etc.) because it doesn’t clutter up the global namespace or potentially clash with something else, while still being easy to type and understand what we’re referring to.

I’ve never used MacroTools.jl, but we’re literally just using the two macros I mentioned; JSON3.jl doesn’t define any macros itself (that users would ever use), which I think is more of the use-case for something like MacroTools.jl.

2 Likes

Yeah, I think it’s just a matter of finding a happy medium. It just occurred to me that it might make sense to define different functions for permissive and strict serialization, but that would seem to have the major disadvantage that I think it would massively complicate your interface. Roughly speaking, I think the default should be:

  • AbstractArray \to JSON lists (nested in case of higher-rank arrays)
  • Tuple \to JSON lists
  • AbstractDict \to JSON dicts
  • AbstractString \to JSON strings
  • Integer \to JSON ints
  • AbstractFloat \to JSON float (with optional precision)
  • Perhaps some other special cases I’m not thinking of?
  • Everything else: throw an error.

Of course this doesn’t address the issue that in JSON only strings are valid keys (I find it pretty annoying that it doesn’t even support integer keys). The only way of dealing with this would seem to be calling string on whatever the key is, which I think is what you’re doing. It might be worth catching particularly crazy cases of that too, not sure.

Well, the main problem is that it’s rank-1 when it should be a matrix. It should serialize to [[1, 0], [0, 1]] or something like that. Of course, it would be nice if it were also actually sparse, but as far as I can tell the only thing you could possibly have done about that would be to specifically define methods for SparseMatrixCSC. I definitely agree that AbstractArray at the very least should have fall-back methods.

Haha, it’s funny you mention these defaults, because that’s exactly how things are implemented in the package right now; that is, we just treat Base types like custom types and define their JSON3.StructType to be one of the JSON types:

1 Like

Nice, in that case the only thing I’d have done differently would be to have it fail more for other types.

Well, like I said, the problem is that we have a generic fallback defined here, which says, "hey, if a type doesn’t define a more specific StructType, then treat it as a basic JSON3.Struct()" which means we get the behavior like you mentioned w/ skipmissing where we just write out the fields by default. I mentioned I’d be willing to get rid of that definition and have it throw instead, which seems like a nice way to avoid those hard-to-debug scenarios where you forgot to define the StructType for your type.

3 Likes

That would definitely make me much better off, it would be interesting to hear if there are any differing opinions on the matter.

1 Like

I agree with your suggestion, I very much prefer throwing errors instead of providing defaults that can backfire in some scenarios.

When externalizing objects from a language as complex as Julia, I think one should just let go of the idea that it should “just work” even for complicated cases, and be prepared to make various choices explicit. Instead of offering comprehensive fallback/default choices, this should just ideally be made easy. I think the trait-based solution is very friendly.

5 Likes

On the other hand, when I try to log an application error, my logs are structured json, and some unexpected type snuck in there (perhaps causing the error, perhaps not), I’d prefer logging succeed with some output awkward to read rather than no useful logs getting collected.

2 Likes

This is really cool. The only aspect I find hard to accept is the subtypekey. Adding an additional type::String field to all my concrete types – a field that will always contain the name of said concrete type – is irritating, for lack of a better word. I think I understand its purpose, but it would be amazing to avoid the need of adding this type field.

For me this becomes relevant when I need to save a vector that contains elements that can be of any of the sub (concrete) types of a single abstract type. It just occurred to me that I might just try to read the json string as a Vector{Any} instead of Vector{MyAbstractType}. If this works, it would be preferable to adding a type::String field to all the concrete types that this vector might contain…

Thanks again for this awesome package. I’m sure it’ll become the standard JSON package.

Having support for multidimensional arrays would be very nice. I see that the relevant test is broken.

Maybe I missed something here, but how does this work with parametric types? Do we need to define a JSON3.StructType for every parameterized combination:

using JSON3
struct MyParametricType{T}
    t::T
    MyParametricType{T}(t) where {T} = new(t)
end
MyParametricType(t::T) where {T} = MyParametricType{T}(t)

x = MyParametricType(1)

JSON3.StructType(::Type{MyParametricType}) = JSON3.Struct()
str = JSON3.write(x) # ERROR: ArgumentError: MyParametricType{String,Int64} doesn't have a defined `JSON3.StructType`

JSON3.StructType(::Type{MyParametricType{Int}}) = JSON3.Struct()
str = JSON3.write(x) # fine

I posted an issue on this, but I’ll gladly close it if someone can show me what I’ve missed.

I haven’t tried this at all but isn’t it enough to just write

JSON3.StructType(::Type{<:MyParametricType}) = JSON3.Struct()

?

2 Likes

Yes it is! Thanks! I’ll add this to the docs.

1 Like

Very interesting! I recently did a lot of work to optimise BSON.jl and it sounds like there is quite a lot of overlap. However you seem to have gone further in some areas. It occurs to me that a lot of this could probably be generalised to both BSON, JSON and probably a lot of other formats as well. It is mostly the same patterns being repeated with quite a high cost in terms of development time.

4 Likes