Hi All!
I have just “finished” implementing an inital version of a JSON Lines reader (JSONLines.jl) and would like to ask for some feedback. It implements a specific style of JSON Lines file used in other implementations (e.g. Clickhouse) which has a JSON object in each line of the file. The actual JSON parsing is done using the LazyJSON.jl library. The most interesting feature in my opinion is that an arbitrary number of lines can be skipped and only a subset of lines can be loaded, resulting in memory allocation only for the loaded lines. This is achieved using mmap
and line parsing done with little allocation (skipping all lines results in 34 allocations with 3.28 KiB).
Any feedback is welcome! Please let me know if you have ideas for additional functionality (the obvious next step is a writer function). My next step will be to implement a Table.jl compatible output instead of returning a DataFrame. Eventually I would like to register the package. Any tips in that direction are also welcome. Thanks!
P.s. Please let me know if this is the wrong category for the post.
EDIT: Updates see here
EDIT2: New package for registration is JSONLines.jl
11 Likes
Very cool
I had never heard of JSON Lines, but it makes perfect sense
Some interesting names using it: JSON Lines Examples
2 Likes
Looks very interesting!
Would you be open to slightly renaming the package? I was thinking that a name like JSONLines.jl
might be more descriptive and recognizable. What do you think?
6 Likes
Thank you! I really like the format especially for large files.
It would be great to have write functionality!
You are completely right. I’ll rename it.
4 Likes
JSONL is the file extension though. I use that format pretty much always… You shouldn’t work with a JSON Vector of 13M JSON objects on memory so JSONL is probably what I work with the most. However, I usually just read the lines I want and then parse them through JSON3.
I am not disagreeing with the assessment but since it is a pretty widely used file extension I don’t think it needs it just like JSON.jl isn’t JavaScriptObjectNotation.jl which is a more descriptive name rather than purely initialisms. We already have a few packages (JSON, JSON2, JSON3) which visually distinguishable enough. Another consideration is consistency which unless there is a strong reason for not following it should abide.
I was not familiar with the JSON Lines file format, so for me it was easy to miss that extra “L” in JSONL.jl.
4 Likes
This seems more appropriate for “Package announcements” category than for “First steps”, so I moved it there. That’s where people discuss new packages.
I’d also go with JSONLines.jl instead of JSONL.jl. Since the official name is also JSON Lines, there is no reason not to add the extra clarity.
5 Likes
Thank you! It seems that “JSONLines.jl” is is clear to everyone while “JSONL.jl” is too close to “JSON.jl”.
4 Likes
There’s also newline delimited JSON, which uses a completely different extension (.ndjson). In light of that, JSONLines sounds general enough to accommodate both.
Is each row of a JSONL file supposed to have the same structure?
Or can you have different object shapes on different rows?
If it is the same object in each row, it would be great to implement a Tables.jl interface
On naming, ultimately your package = your decision
Suggestion…
I see you return a DataFrame. I don’t use DataFrames so that seems like an unnecessary imposition.
You might consider introducing something like
struct JSONL <: AbstractVector{Object}
objects::Vector{Object}
end
and implement a Tables.jl interface so if we want to drop it into a DataFrame, we can just do
DataFrame(jsonl::JSONL)
If we want a StructArray (which is what I use), we’d just do
StructArray(jsonl)
Another suggestion…
Consider using JSON3 and allow the user to specify the struct type of each row so you can immediately construct a vector of structs. I think this would also be faster.
Just some ideas thinking aloud. I see so much potential for this
1 Like
I’m working on the Tables.jl
interface right now but am struggling a bit with it. Will ask a separate question once I am fully stuck. JSON3 looks good will try that next. Basically I would read each line and let users pass an optional struct for the types correct?
Thank you very much for the suggestions!
2 Likes
See topic here. Unfortunately, I cannot get StructArray
to work with it.
Each line doesn’t have to have the same structure (no schema à la NoSQL).
I usually do
using JSON3: JSON3, StructType, Mutable
mutable struct Node
nodeID::String
nodeUserID::String
parentID::String
nodeTime::String
informationID::String
Status() = new()
end
StructType(::Type{Node}) = Mutable()
lns = readlines("file.jsonl")
data = DataFrame(JSON3.read(ln, Node) for ln in lns)
for writing
io = open(touch("file.jsonl"), write = true)
for node in nodes
JSON3.write(io, node)
write(io, '\n')
end
close(io)
4 Likes
Thank you for the suggestion. My plan would be that users can supply Node
. I guess it is not possible in this case to implement Tables.jl
?
Could you give a small example of what file.jsonl
could look like so I can play around with it and know exactly what you need?
1 Like