How to efficiently parse null as NaN in JSON?

Hi
I want to parse lines of JSON containing among other things a vector of signed ints with some values being “null”, like that :

{"data":[98,null,-51,null]}

The way I do it currently is by using a conversion function like that (I want the data as Float16 to save some memory) :

function conversion(a)
    a == nothing ? NaN16 : Float16(a*0.1)
end
nice_vector = conversion.(JSON3.read(JSON_line)[:data])

This works, but this conversion step itself is twice as expensive as the JSON.read (roughly 200 μs for the read operation for a 5000-elements vector).

I process each JSON Lines file like that :

function JSON_Lines_read(filename)
    l1 = readline(filename)
    nlines = countlines(filename)                          
    npoints = l1 |> JSON3.read |> x->x[:data] |> length     
    out = Matrix{Float16}(undef, npoints, nlines)
    i = 1
    for line in eachline(filename)
        out[:,i] = conversion.(JSON3.read(line)[:data])
        i+=1
    end
    return out
end

In total it takes 25s to process each of my 500 MB JSON Lines files which desn’t seem that fast.
I sometimes have hundreds of GB of these files to process so it would be nice to be able to read them faster.
I tried naively to do the parsing manually with split and parse operators but it is at least two times slower than this method.

Do you think there is a better way to do that ?

1 Like

JSON3.jl has an API for defining custom type conversion rules. It’s quite nice!

I don’t have time tonight to draft up an example solution, but if you have trouble, I could circle back to this later.