[ANN] TypedJSON.jl - A Julia serialization library prioritizing type fidelity, human-readability, and long-term archival

Dear all,
this is an announcement for TypedJSON.jl, (yet another!) JSON serialization library focusing on type fidelity and long-term archival.

It came out of my specific needs to:

  • Easily serialize/deserialize from/to Julia structures, with limited additional boilerplate and a clear way to support new structures;
  • Be absolutely sure that even in 10 years from now (i.e. when the structure definition has changed…) I can read back my data;
  • Be able to read the same data from Python, javascript, etc.

There likely is room for improvements, so comments / suggestions are appreciated!

14 Likes

Thanks for writing this! I frequently run into the issue of having to reconstruct composite types after they have changed, so this will be a nice addition to my toolkit.

I don’t mind having to write a reconstruction method, I think of this as a feature. I like the human-readable output.

3 Likes

Do you have an example of what TypedJSON.jl can do that JSON.jl v1 cannot? I was not able to gather that from the README.

3 Likes

And some examples of the JSON it produces would be helpful too.

3 Likes

This is pretty vibe-coded, right?

If so, then I think we currently ask for disclosure in the Readme.md.

Some general notes:

  1. Afaiu you don’t round-trip element types for arrays → you lose fidelity and performance.
  2. You decide to serialize “difficult” objects like Function as nothing. With your intended use for long-time storage (i.e. allow a human to use the data, long after all code + docs + first hand experience handling it has been lost), I think some configurable warning would be appropriate whenever your serialization discards data. It would suck for future people to come in to pick up a long-discarded project or help a replication attempt, only to discover that all data is lost because it lived in a closure.
  3. I have not seen an obvious way to pop a shell on deserialization of malicious data. However, the general architecture is a recipe for insecure deserialization – maybe explicitly document that people shouldn’t reconstruct / deserialize untrusted input?

The general architecture that invites shell-popping: Types for reconstruction use a global registry, namely the dispatch table for specializations of TypedJson.reconstruct(::Val{<:Symbol}, ::Any), where the symbol is attacker controlled. Hence, each loaded package that defines reconstructions gives new gadgets. This is similar to deserialization gadgets in java, where any unforeseen interaction between stuff in the classpath can make deserialization unsafe. The general consensus is that “control your classpath such that no gadget chain exists” is not viable (it breaks all modularity!), and java deserialization must be considered unsafe in any environment.

With this general architecture, you will never achieve safe / secure deserialization. Which is totally fine if it is not your design goal, just be very upfront about it in your Readme.md.

2 Likes

Sorry, I overlooked this aspect…
I added a new section in the README: Compare with JSON.jl

Good point, thanks!
I added a new section: How do the “typed JSON” looks like?

The package code is not AI-generated, while the test suite is mainly provided by Gemini.

In some case the types are not properly reconstructed, I added a note in the Caveats section.
Performance is not a goal here…

The only way to check for correct serialization and deserialization is to use the roundtrip function. I added a clear statement in the Caveats section.

This is a very interesting topic, thank you for raising it!
But it’s not clear to me how one can perform shell-popping here by just providing a malicious JSON file to deserialize. Can you provide an explicit example?

As long as I understand you need both the JSON file and a malicious package to do the job, hence the problem becomes to warn the user not to install untrusted packages… Or am I missing something?

2 Likes

No, not with the gadgets you provide – as I said, I did not see an obvious way.

I can give you two analogies:

  1. Security issue: Type confusion, convert called during deserialization · Issue #117 · JuliaIO/JLD2.jl · GitHub is an example how to pop a shell when deserializing jld2
  2. In the java world, https://medium.com/@dub-flow/deserialization-what-the-heck-actually-is-a-gadget-chain-1ea35e32df69 explains the notion of deserialization gadgets.

In jld2, I used the following gadgets:

  1. Broadcasted objects can execute shell-commands on index access Base.Broadcast.Broadcasted(run, ([`cat /etc/passwd`],))
  2. convert accesses indices
  3. jld2 sometimes calls constructors that call convert during deserialization

Your situation is better than jld2’s: You only deserialize things that have explicit reconstruct methods.

The bad thing is the fact that the set of gadgets is still a global thing: all methods of reconstruct for you – and this is expected to be extended by packages / people; and all classes implementing Serializable on the classpath in java. Hence, it is insufficient to say whether “one package (including its upstream / dependencies)” has usable gadgets. In this approach, security does not compose: The whole program / environment needs to be analyzed. In other words, using Foo and using Bar may both lead to safe environments, but using Foo, Bar may contain an exploitable gadget chain, due to interactions between Foo and Bar gadgets that are benign in isolation.

This makes security unmaintainable.

The annoying approach to fix that is that each deserialize call needs to already limit to a bounded world of reachable types that can be produced. Then, assuming no type piracy, each such call / world can be analyzed in isolation. In other words: You can safely deserialize Foo objects in one part of your program, and you can safely deserialize Bar objects in another part of your program; but you cannot safely deserialize mixed Foo/Bar objects, due to bad interactions; hence, the set of deserializable objects must be local to each part of your program, not global.

PS.

No. You need a malicious JSON file and a vulnerable combination of packages. Due to the global construction, it is insufficient for each package to be not-vulnerable in isolation.

(meesage edited)

Thanks for the details!

Do JLD2 presents the same problem?
More specifically: I already noticed the warning in the documentation, I am just wondering whether there is some architectural choice I can borrow to improve the security?

In any case, I added a warning the README.

1 Like

JLD2 has it much worse. They don’t limit deserialization to a whitelist of supported types (your global whitelist is the explicit method table of reconstruct – which depends on the set of loaded packages).

Instead JLD2, and also the stdlib serialize/deserialize, and also GitHub - JuliaIO/BSON.jl try to reconstruct arbitrary types.

Hence, these 3 libs for serialization / deserialization make it easy to pop a shell. I think I posted examples / proofs of concept somewhere how to pop a shell on stdlib deserialize and bson deserialize.

Your library doesn’t pop a shell when glancing at it, so :+1: for you. And if security against malicious input (i.e. data sharing between strangers) is not your design goal, then I have no complaints.

Otoh if data sharing between strangers is your goal, a la “let’s publish this scientific data for the world at large in typedJSON form”, then I predict that the underlying design with the global whitelist/reconstruct method will blow up in somebodies face, 3-5 years down the line, in the same way that java Serializable has blown up into the entire java world’s face.

1 Like

OK, thank you very much for all the clarifications!