Initial version of my first package: A JSON Lines reader

Hi All!

I have just “finished” implementing an inital version of a JSON Lines reader (JSONLines.jl) and would like to ask for some feedback. It implements a specific style of JSON Lines file used in other implementations (e.g. Clickhouse) which has a JSON object in each line of the file. The actual JSON parsing is done using the LazyJSON.jl library. The most interesting feature in my opinion is that an arbitrary number of lines can be skipped and only a subset of lines can be loaded, resulting in memory allocation only for the loaded lines. This is achieved using mmap and line parsing done with little allocation (skipping all lines results in 34 allocations with 3.28 KiB).
Any feedback is welcome! Please let me know if you have ideas for additional functionality (the obvious next step is a writer function). My next step will be to implement a Table.jl compatible output instead of returning a DataFrame. Eventually I would like to register the package. Any tips in that direction are also welcome. Thanks!
P.s. Please let me know if this is the wrong category for the post.

EDIT: Updates see here

EDIT2: New package for registration is JSONLines.jl

11 Likes

Very cool :+1::blush:

I had never heard of JSON Lines, but it makes perfect sense :+1:

Some interesting names using it: JSON Lines Examples

2 Likes

Looks very interesting!

Would you be open to slightly renaming the package? I was thinking that a name like JSONLines.jl might be more descriptive and recognizable. What do you think?

6 Likes

Thank you! I really like the format especially for large files.

It would be great to have write functionality!

You are completely right. I’ll rename it. :smiley:

4 Likes

JSONL is the file extension though. I use that format pretty much always… You shouldn’t work with a JSON Vector of 13M JSON objects on memory so JSONL is probably what I work with the most. However, I usually just read the lines I want and then parse them through JSON3.

Sure, but:

  1. JSON Lines is the name of the format in the documentation (http://jsonlines.org/)
  2. JSONLines.jl is very easy to distinguish visually. JSONL.jl looks very similar to JSON.jl.
  3. JSONLines.jl is a much more descriptive package name. As a general rule, we try to encourage package authors to choose more descriptive package names, instead of package names that are purely initialisms.
9 Likes

I am not disagreeing with the assessment but since it is a pretty widely used file extension I don’t think it needs it just like JSON.jl isn’t JavaScriptObjectNotation.jl which is a more descriptive name rather than purely initialisms. We already have a few packages (JSON, JSON2, JSON3) which visually distinguishable enough. Another consideration is consistency which unless there is a strong reason for not following it should abide.

I was not familiar with the JSON Lines file format, so for me it was easy to miss that extra “L” in JSONL.jl.

4 Likes

This seems more appropriate for “Package announcements” category than for “First steps”, so I moved it there. That’s where people discuss new packages.

I’d also go with JSONLines.jl instead of JSONL.jl. Since the official name is also JSON Lines, there is no reason not to add the extra clarity. :wink:

5 Likes

Thank you! It seems that “JSONLines.jl” is is clear to everyone while “JSONL.jl” is too close to “JSON.jl”.

4 Likes

There’s also newline delimited JSON, which uses a completely different extension (.ndjson). In light of that, JSONLines sounds general enough to accommodate both.

Is each row of a JSONL file supposed to have the same structure?

Or can you have different object shapes on different rows?

If it is the same object in each row, it would be great to implement a Tables.jl interface :slight_smile:

On naming, ultimately your package = your decision :slight_smile:

Suggestion…

I see you return a DataFrame. I don’t use DataFrames so that seems like an unnecessary imposition.

You might consider introducing something like

struct JSONL <: AbstractVector{Object}
    objects::Vector{Object}
end

and implement a Tables.jl interface so if we want to drop it into a DataFrame, we can just do

DataFrame(jsonl::JSONL)

If we want a StructArray (which is what I use), we’d just do

StructArray(jsonl)

Another suggestion…

Consider using JSON3 and allow the user to specify the struct type of each row so you can immediately construct a vector of structs. I think this would also be faster.

Just some ideas thinking aloud. I see so much potential for this :slight_smile:

1 Like

I’m working on the Tables.jl interface right now but am struggling a bit with it. Will ask a separate question once I am fully stuck. JSON3 looks good will try that next. Basically I would read each line and let users pass an optional struct for the types correct?
Thank you very much for the suggestions!

2 Likes

See topic here. Unfortunately, I cannot get StructArray to work with it.

Each line doesn’t have to have the same structure (no schema à la NoSQL).
I usually do

using JSON3: JSON3, StructType, Mutable
mutable struct Node
    nodeID::String
    nodeUserID::String
    parentID::String
    nodeTime::String
    informationID::String
    Status() = new()
end
StructType(::Type{Node}) = Mutable()
lns = readlines("file.jsonl")
data = DataFrame(JSON3.read(ln, Node) for ln in lns)

for writing

io = open(touch("file.jsonl"), write = true)
for node in nodes
    JSON3.write(io, node)
    write(io, '\n')
end
close(io)
4 Likes

Thank you for the suggestion. My plan would be that users can supply Node. I guess it is not possible in this case to implement Tables.jl?

Could you give a small example of what file.jsonl could look like so I can play around with it and know exactly what you need?

1 Like