Initial version of my first package: A JSON Lines reader

Thank you! It seems that “JSONLines.jl” is is clear to everyone while “JSONL.jl” is too close to “JSON.jl”.

4 Likes

There’s also newline delimited JSON, which uses a completely different extension (.ndjson). In light of that, JSONLines sounds general enough to accommodate both.

Is each row of a JSONL file supposed to have the same structure?

Or can you have different object shapes on different rows?

If it is the same object in each row, it would be great to implement a Tables.jl interface :slight_smile:

On naming, ultimately your package = your decision :slight_smile:

Suggestion…

I see you return a DataFrame. I don’t use DataFrames so that seems like an unnecessary imposition.

You might consider introducing something like

struct JSONL <: AbstractVector{Object}
    objects::Vector{Object}
end

and implement a Tables.jl interface so if we want to drop it into a DataFrame, we can just do

DataFrame(jsonl::JSONL)

If we want a StructArray (which is what I use), we’d just do

StructArray(jsonl)

Another suggestion…

Consider using JSON3 and allow the user to specify the struct type of each row so you can immediately construct a vector of structs. I think this would also be faster.

Just some ideas thinking aloud. I see so much potential for this :slight_smile:

1 Like

I’m working on the Tables.jl interface right now but am struggling a bit with it. Will ask a separate question once I am fully stuck. JSON3 looks good will try that next. Basically I would read each line and let users pass an optional struct for the types correct?
Thank you very much for the suggestions!

2 Likes

See topic here. Unfortunately, I cannot get StructArray to work with it.

Each line doesn’t have to have the same structure (no schema à la NoSQL).
I usually do

using JSON3: JSON3, StructType, Mutable
mutable struct Node
    nodeID::String
    nodeUserID::String
    parentID::String
    nodeTime::String
    informationID::String
    Status() = new()
end
StructType(::Type{Node}) = Mutable()
lns = readlines("file.jsonl")
data = DataFrame(JSON3.read(ln, Node) for ln in lns)

for writing

io = open(touch("file.jsonl"), write = true)
for node in nodes
    JSON3.write(io, node)
    write(io, '\n')
end
close(io)
4 Likes

Thank you for the suggestion. My plan would be that users can supply Node. I guess it is not possible in this case to implement Tables.jl?

Could you give a small example of what file.jsonl could look like so I can play around with it and know exactly what you need?

1 Like

IMHO:
with a little change it can be extend to process wikidata json dumps

  • Wikidata Json Dump is very huge , compressed gz > 80GB:
    • latest-all.json.bz2 05-Aug-2020 11:19 58459544787
    • latest-all.json.gz 05-Aug-2020 05:40 87793345321
  • it is “a single JSON array”
  • the structure is similar to JSONLines - with this extra parameters :
    • extra first line [
    • extra last line ]
    • extra comma as a line separator: ",\n" or ",\r\n"
[
{"type":"item","id":"Q31","labels":  .... },
{"type":"item","id":"Q8","labels":   .... },
...
]
  • test command check: zcat latest-all.json.gz | head -n3

because - it is large compressed file ( > 82G ) the keys:

  • compressed file reading
  • thread support ( channels ? ) for filtering … parallel processing
  • writing the filtered result to similar JSON Dump file
  • speed / multi core support …

My use case:

  • pre-processing wikidata JSON dump
  • filtering geodata related items
  • and write a smaller JSON dump OR load to PostGIS database.

IMHO: It is not critical - because I have a Golang script … but for the future it can be an interesting use case for Julia ( and can be a good benchmark ! )

1 Like

See for example this test dataset. I do have pretty big files but most are confidential.

2 Likes

Maybe out of topic but has anyone tried to port simdjson to Julia?

I think that it might be better to have this work exactly to the spec for JSON Lines. This would be more in the spirit of a typical Julia package – it does exactly what the name implies.

IMHO If people need to read JSON files that are similar to, but not the same as, JSON Lines, then they can hack it together using your package and other JSON readers. Perhaps if you have unexported functions that might help in these regards you can document them.

2 Likes

Makes sense. I guess I can internally expose the character expected before \n so that it can be changed by others.

1 Like

Thank you for taking the time to put that together. I’ll try it out.

I tried running this code with a slight modification but it does not seem to work as a DataFrames input. Using the file you provided as test input:

using JSON3: JSON3, StructType, Mutable
using DataFrames
mutable struct Node
    nodeID::String
    nodeUserID::String
    parentID::String
    nodeTime::String
    informationID::String
    Node() = new()
end
StructType(::Type{Node}) = Mutable()
lns = readlines("test.jsonl")
data = [JSON3.read(ln, Node) for ln in lns]
DataFrame(data)
julia> data = [JSON3.read(ln, Node) for ln in lns]
5001-element Array{Node,1}:
 Node(#undef, #undef, #undef, #undef, #undef)
 Node("phkmqmzpbv", "ufmgvgoure", "sgdibnesgi", "2019-03-05 13:10:51", "tyaekgxmsr")
 Node("ugtkjodxer", "fhffxqbsoa", #undef, "2019-03-04 14:09:58", "bolwagblhx")
 Node("alwwtfyunw", "qccqocykfm", "orjhhcvomh", "2019-03-05 01:37:58", "cagyezgppo")
 ⋮
 Node("zuwjjgexbl", "opmkvyipxm", "sxavzrxldl", "2019-03-04 19:09:19", "tswwdiktno")
 Node("ubaomyspwd", "jdoescksnv", #undef, "2019-03-05 15:18:43", "dbermznthm")
 Node("optbrwcfli", "trrheeevlx", "ooxfkaspca", "2019-03-04 21:51:06", "vhdzurjxro")
 Node("mlcxmfplei", "ojshhztncj", "ygotmuetnj", "2019-03-04 18:14:03", "rukafhqowm")

julia> DataFrame(data)
ERROR: UndefRefError: access to undefined reference

Any idea what I am doing wrong? (the modification is Status = new() => Node = new() otherwise I get a Node is not callable error)

EDIT: The problem are the #undef values. I can create e.g. a Tables.columntable out of rows 2 and 4
EDIT2: The following redefinition does work as a Tables.jl input as it avoids undef in construction

using StructTypes
mutable struct Node
    nodeID::Union{String, Missing}
    nodeUserID::Union{String, Missing}
    parentID::Union{String, Missing}
    nodeTime::Union{String, Missing}
    informationID::Union{String, Missing}
    Node() = new(missing, missing, missing, missing, missing)
end
StructTypes.StructType(::Type{Node}) = StructTypes.Mutable()

One thing to remember is that JSON is à la NoSQL in terms of not every record has the same schema. For example, in the example JSONL I shared the first line is a header which has a different schema from lines 2:end. I don’t remember if that particular file consistently uses null for missing vs omitted.

You can also check this other example (will be available for 90 days, just need to download and inflate, 78 lines where each is a JSON Vector).

1 Like

I guess in case the lines do not have the same schema the Tables.jl interface cannot be implemented but it should be possible now to load such a file. The problem with structtypes is that #undef is returned for missing values.

1 Like

So a bunch of updates:

  • Using JSON3.jl for parsing each row
  • Returning the vector of JSON3.Objects since that already works with the Tables.jl interface if the schema allows it. This also removed the DataFrames dependency.
  • New keyword argument: structtype allows users to pass a StructTypes.jl struct to the JSON3.read function for each row (could be that the result still works for Tables.jl but not necessarily due to undef being returned of a value is not available in a row => any suggestions?)

Edit:
Credit for the Tables.jl insight goes to @piever!

3 Likes

Julia natively supports missing, so why not fill in those spots with it?