Parse a large array of small JSON objects?

Hi,

I have a DataFrame that contains a column of several million JSON strings, e.g.

julia> DataFrame(j=""""{a": 1, "b": 2}""")
julia> pretty_table(df)
┌──────────────────┐
│                j │
│           String │
├──────────────────┤
│ "{a": 1, "b": 2} │
└──────────────────┘

The simple way to parse them is with something like df[:, "j"] .= JSON3.read.(df.j). However, this appears to create a memory leak when I do it repeatedly on different datasets, even when df is no longer reachable, the memory appears to stay used and the program eventually runs into an OOM error. In contrast this doesn’t happen if I don’t do the JSON parsing.

Is there a better practice for parsing a lot of small JSON objects that I’m not aware of? I saw that JSON3 uses a semi-lazy evaluation method, and was wondering if that’s confusing the garbage collector somehow.

I think the default JSON3.Object’s etc will reference the object, but you can tell JSON3 how to use something else. E.g. JSON3.read(obj, Dict) says to parse obj into a regular Dict (which then does not share memory with the original object). If they are all of a certain structured form, you could also define a struct and use StructTypes to say how to (de)serialize to it, then do JSON3.read(obj, MyStruct) to parse into that structure.

1 Like

JSON3.read(obj, Dict) solved the memory leak. Thanks!

1 Like