Initial version of my first package: A JSON Lines reader

danielw2904 · August 8, 2020, 7:33am

Hi All!

I have just “finished” implementing an inital version of a JSON Lines reader (JSONLines.jl) and would like to ask for some feedback. It implements a specific style of JSON Lines file used in other implementations (e.g. Clickhouse) which has a JSON object in each line of the file. The actual JSON parsing is done using the LazyJSON.jl library. The most interesting feature in my opinion is that an arbitrary number of lines can be skipped and only a subset of lines can be loaded, resulting in memory allocation only for the loaded lines. This is achieved using mmap and line parsing done with little allocation (skipping all lines results in 34 allocations with 3.28 KiB).
Any feedback is welcome! Please let me know if you have ideas for additional functionality (the obvious next step is a writer function). My next step will be to implement a Table.jl compatible output instead of returning a DataFrame. Eventually I would like to register the package. Any tips in that direction are also welcome. Thanks!
P.s. Please let me know if this is the wrong category for the post.

EDIT: Updates see here

EDIT2: New package for registration is JSONLines.jl

anon67531922 · August 8, 2020, 7:49am

Very cool

I had never heard of JSON Lines, but it makes perfect sense

Some interesting names using it: JSON Lines Examples

dilumaluthge · August 8, 2020, 8:33am

Looks very interesting!

Would you be open to slightly renaming the package? I was thinking that a name like JSONLines.jl might be more descriptive and recognizable. What do you think?

danielw2904 · August 8, 2020, 8:33am

Thank you! I really like the format especially for large files.

dilumaluthge · August 8, 2020, 8:34am

It would be great to have write functionality!

danielw2904 · August 8, 2020, 8:35am

You are completely right. I’ll rename it.

Nosferican · August 8, 2020, 8:59am

JSONL is the file extension though. I use that format pretty much always… You shouldn’t work with a JSON Vector of 13M JSON objects on memory so JSONL is probably what I work with the most. However, I usually just read the lines I want and then parse them through JSON3.

dilumaluthge · August 8, 2020, 9:03am

Sure, but:

JSON Lines is the name of the format in the documentation (http://jsonlines.org/)
JSONLines.jl is very easy to distinguish visually. JSONL.jl looks very similar to JSON.jl.
JSONLines.jl is a much more descriptive package name. As a general rule, we try to encourage package authors to choose more descriptive package names, instead of package names that are purely initialisms.

Nosferican · August 8, 2020, 10:12am

I am not disagreeing with the assessment but since it is a pretty widely used file extension I don’t think it needs it just like JSON.jl isn’t JavaScriptObjectNotation.jl which is a more descriptive name rather than purely initialisms. We already have a few packages (JSON, JSON2, JSON3) which visually distinguishable enough. Another consideration is consistency which unless there is a strong reason for not following it should abide.

CameronBieganek · August 8, 2020, 1:58pm

I was not familiar with the JSON Lines file format, so for me it was easy to miss that extra “L” in JSONL.jl.

anon37204545 · August 8, 2020, 2:25pm

This seems more appropriate for “Package announcements” category than for “First steps”, so I moved it there. That’s where people discuss new packages.

I’d also go with JSONLines.jl instead of JSONL.jl. Since the official name is also JSON Lines, there is no reason not to add the extra clarity.

danielw2904 · August 8, 2020, 2:27pm

Thank you! It seems that “JSONLines.jl” is is clear to everyone while “JSONL.jl” is too close to “JSON.jl”.

ToucheSir · August 8, 2020, 4:19pm

There’s also newline delimited JSON, which uses a completely different extension (.ndjson). In light of that, JSONLines sounds general enough to accommodate both.

anon67531922 · August 8, 2020, 7:25pm

Is each row of a JSONL file supposed to have the same structure?

Or can you have different object shapes on different rows?

If it is the same object in each row, it would be great to implement a Tables.jl interface

anon67531922 · August 8, 2020, 7:48pm

On naming, ultimately your package = your decision

Suggestion…

I see you return a DataFrame. I don’t use DataFrames so that seems like an unnecessary imposition.

You might consider introducing something like

struct JSONL <: AbstractVector{Object}
    objects::Vector{Object}
end

and implement a Tables.jl interface so if we want to drop it into a DataFrame, we can just do

DataFrame(jsonl::JSONL)

If we want a StructArray (which is what I use), we’d just do

StructArray(jsonl)

Another suggestion…

Consider using JSON3 and allow the user to specify the struct type of each row so you can immediately construct a vector of structs. I think this would also be faster.

Just some ideas thinking aloud. I see so much potential for this

danielw2904 · August 8, 2020, 7:55pm

I’m working on the Tables.jl interface right now but am struggling a bit with it. Will ask a separate question once I am fully stuck. JSON3 looks good will try that next. Basically I would read each line and let users pass an optional struct for the types correct?
Thank you very much for the suggestions!

danielw2904 · August 8, 2020, 10:56pm

See topic here. Unfortunately, I cannot get StructArray to work with it.

Nosferican · August 9, 2020, 1:07am

Each line doesn’t have to have the same structure (no schema à la NoSQL).
I usually do

using JSON3: JSON3, StructType, Mutable
mutable struct Node
    nodeID::String
    nodeUserID::String
    parentID::String
    nodeTime::String
    informationID::String
    Status() = new()
end
StructType(::Type{Node}) = Mutable()
lns = readlines("file.jsonl")
data = DataFrame(JSON3.read(ln, Node) for ln in lns)

for writing

io = open(touch("file.jsonl"), write = true)
for node in nodes
    JSON3.write(io, node)
    write(io, '\n')
end
close(io)

danielw2904 · August 9, 2020, 9:38am

Thank you for the suggestion. My plan would be that users can supply Node. I guess it is not possible in this case to implement Tables.jl?

danielw2904 · August 9, 2020, 10:15am

Could you give a small example of what file.jsonl could look like so I can play around with it and know exactly what you need?

Topic		Replies	Views
[ANN] JSON3.jl - Yet another JSON package for Julia Package Announcements	23	10633	September 19, 2020
[ANN] JSONLines.jl ~1.0.0 and preview to 1.1.0~ 1.2.0 and preview to 1.3.0 Package Announcements package , announcement	1	520	August 21, 2020
Announce: A different way to read JSON data, LazyJSON.jl Data	19	10032	October 2, 2018
[ANN] JSONLines v.2.0.0 Package Announcements	0	371	September 1, 2020
Efficiently Read JSON and Create DataFrame Performance json , dataframes	23	7733	April 3, 2025

Initial version of my first package: A JSON Lines reader

Related topics