I recently came across TOON (Token-Oriented Object Notation), which is described as a data format designed to be more LLM-friendly than JSON. The format aims to reduce token usage while maintaining readability and type safety.
Some interesting features:
Compact syntax that reduces LLM token consumption
Built-in type annotations
Object-oriented structure
There’s an interactive playground to explore how it compares to JSON in terms of tokenization
There’s already a Python implementation available, which got me thinking about Julia.
Has anyone:
Heard of or worked with TOON format?
Started (or would be interested in) a Julia implementation?
Given Julia’s strengths in parsing and the growing interest in LLM applications, it seems like TOON could be a useful addition to the ecosystem. I’m curious if there’s any existing work in this direction or if others think this would be a valuable package to develop.
Just a heads-up to avoid frustration when you do feel ready to register it: The package name TOON.jl will probably not be able to make it into the General registry (too short/all-caps). I’d probably recommend writing out the acronym to TokenOrientedObjectNotation.jl. You should also be aware about the guidelines for LLM usage for registered packages.
There are lots of packages with short names in the General registry but the naming guidelines have evolved over time and old decisions don’t work as precedence for new packages.
That said, if TOON was as ubiquitous as JSON it would likely be accepted today but at this point there is no way to tell whether it will eventually overtake JSON or be all but forgotten in half a year.
Indeed, TOON is a meme on X/Twitter already. Incredibly, the vibe-coders have re-invented a (worse) version of CSV from first principles and made it viral on LinkedIn (the professional social network where no professional software developers exist to explain to them what they’ve done).
If you quickly want to build something in Julia using TOON, you could use the Python library with PyCall / PythonCall / etc. Can get up and running quickly, if your goal is to evaluate TOON.
To the extent that TOON works right now, it’s presumably a combination of models being comfortable with YAML and CSV. But in the long run, the “correct” protocol to communicate with models is always a function of what they’ve been pretrained/finetuned to. So, unless the major model players adopt it as a first-class protocol, TOON or whatever other protocol is unlikely to beget maximum performance. This is one of those cases where the innovation has to come through the model providers and it’s not just a question of end-users adopting something based on vibes. (Just my two cents, YMMV)
If the main win of TOON is to avoid repeated field names by compressing into a CSV-like format, it’s presumably easy to cook up something like that with JSON, to get most of the benefits. Or give models literal CSV+JSON+YAML content; but whatever… there are many ways to paint the bikeshed.
(Okay, I’m getting baited and can’t help myself) If one really cares for efficient communication with minimal token count, then maybe the best way would to look into the plethora of serialization formats for storing/transmitting data, finding the most suitable one, and then training whatever AI model to become fluent in that.