Initial version of my first package: A JSON Lines reader

ImreSamu · August 9, 2020, 11:18am

IMHO:
with a little change it can be extend to process wikidata json dumps

Wikidata Json Dump is very huge , compressed gz > 80GB:
- latest-all.json.bz2 05-Aug-2020 11:19 58459544787
- latest-all.json.gz 05-Aug-2020 05:40 87793345321
it is “a single JSON array”
the structure is similar to JSONLines - with this extra parameters :
- extra first line [
- extra last line ]
- extra comma as a line separator: ",\n" or ",\r\n"

[
{"type":"item","id":"Q31","labels":  .... },
{"type":"item","id":"Q8","labels":   .... },
...
]

test command check: zcat latest-all.json.gz | head -n3

because - it is large compressed file ( > 82G ) the keys:

compressed file reading
thread support ( channels ? ) for filtering … parallel processing
writing the filtered result to similar JSON Dump file
speed / multi core support …

My use case:

pre-processing wikidata JSON dump
filtering geodata related items
and write a smaller JSON dump OR load to PostGIS database.

IMHO: It is not critical - because I have a Golang script … but for the future it can be an interesting use case for Julia ( and can be a good benchmark ! )

Nosferican · August 9, 2020, 2:32pm

See for example this test dataset. I do have pretty big files but most are confidential.

tlienart · August 9, 2020, 3:08pm

Maybe out of topic but has anyone tried to port simdjson to Julia?

tbeason · August 9, 2020, 3:11pm

I think that it might be better to have this work exactly to the spec for JSON Lines. This would be more in the spirit of a typical Julia package – it does exactly what the name implies.

IMHO If people need to read JSON files that are similar to, but not the same as, JSON Lines, then they can hack it together using your package and other JSON readers. Perhaps if you have unexported functions that might help in these regards you can document them.

danielw2904 · August 9, 2020, 4:28pm

Makes sense. I guess I can internally expose the character expected before \n so that it can be changed by others.

danielw2904 · August 9, 2020, 6:18pm

Thank you for taking the time to put that together. I’ll try it out.

danielw2904 · August 10, 2020, 2:23pm

I tried running this code with a slight modification but it does not seem to work as a DataFrames input. Using the file you provided as test input:

using JSON3: JSON3, StructType, Mutable
using DataFrames
mutable struct Node
    nodeID::String
    nodeUserID::String
    parentID::String
    nodeTime::String
    informationID::String
    Node() = new()
end
StructType(::Type{Node}) = Mutable()
lns = readlines("test.jsonl")
data = [JSON3.read(ln, Node) for ln in lns]
DataFrame(data)

julia> data = [JSON3.read(ln, Node) for ln in lns]
5001-element Array{Node,1}:
 Node(#undef, #undef, #undef, #undef, #undef)
 Node("phkmqmzpbv", "ufmgvgoure", "sgdibnesgi", "2019-03-05 13:10:51", "tyaekgxmsr")
 Node("ugtkjodxer", "fhffxqbsoa", #undef, "2019-03-04 14:09:58", "bolwagblhx")
 Node("alwwtfyunw", "qccqocykfm", "orjhhcvomh", "2019-03-05 01:37:58", "cagyezgppo")
 ⋮
 Node("zuwjjgexbl", "opmkvyipxm", "sxavzrxldl", "2019-03-04 19:09:19", "tswwdiktno")
 Node("ubaomyspwd", "jdoescksnv", #undef, "2019-03-05 15:18:43", "dbermznthm")
 Node("optbrwcfli", "trrheeevlx", "ooxfkaspca", "2019-03-04 21:51:06", "vhdzurjxro")
 Node("mlcxmfplei", "ojshhztncj", "ygotmuetnj", "2019-03-04 18:14:03", "rukafhqowm")

julia> DataFrame(data)
ERROR: UndefRefError: access to undefined reference

Any idea what I am doing wrong? (the modification is Status = new() => Node = new() otherwise I get a Node is not callable error)

EDIT: The problem are the #undef values. I can create e.g. a Tables.columntable out of rows 2 and 4
EDIT2: The following redefinition does work as a Tables.jl input as it avoids undef in construction

using StructTypes
mutable struct Node
    nodeID::Union{String, Missing}
    nodeUserID::Union{String, Missing}
    parentID::Union{String, Missing}
    nodeTime::Union{String, Missing}
    informationID::Union{String, Missing}
    Node() = new(missing, missing, missing, missing, missing)
end
StructTypes.StructType(::Type{Node}) = StructTypes.Mutable()

Nosferican · August 10, 2020, 3:11pm

One thing to remember is that JSON is à la NoSQL in terms of not every record has the same schema. For example, in the example JSONL I shared the first line is a header which has a different schema from lines 2:end. I don’t remember if that particular file consistently uses null for missing vs omitted.

You can also check this other example (will be available for 90 days, just need to download and inflate, 78 lines where each is a JSON Vector).

danielw2904 · August 10, 2020, 4:34pm

I guess in case the lines do not have the same schema the Tables.jl interface cannot be implemented but it should be possible now to load such a file. The problem with structtypes is that #undef is returned for missing values.

danielw2904 · August 10, 2020, 4:40pm

So a bunch of updates:

Using JSON3.jl for parsing each row
Returning the vector of JSON3.Objects since that already works with the Tables.jl interface if the schema allows it. This also removed the DataFrames dependency.
New keyword argument: structtype allows users to pass a StructTypes.jl struct to the JSON3.read function for each row (could be that the result still works for Tables.jl but not necessarily due to undef being returned of a value is not available in a row => any suggestions?)

Edit:
Credit for the Tables.jl insight goes to @piever!

tbeason · August 10, 2020, 4:41pm

Julia natively supports missing, so why not fill in those spots with it?

danielw2904 · August 10, 2020, 5:00pm

How can I replace all undefs? I could not find any documentation on this. It is already possible for users to pass a struct that is initialized to missing as a workaround.

dilumaluthge · August 10, 2020, 5:33pm

I guess you can do use the isassigned function to check if a field is undef or not.

EDIT: I’m not sure if that works for structs. It works for arrays.

danielw2904 · August 10, 2020, 7:46pm

I’ll try it out

danielw2904 · August 11, 2020, 6:25am

Unfortunately it seems like isassigned only works for arrays. I get errors both for the assigned values and the unassigned ones with

[isassigned(getproperty(data[2], name)) for name in propertynames(data[2])]

thofma · August 11, 2020, 8:30am

Try isdefined. More precisely, try isdefined(data[2], name).

danielw2904 · August 11, 2020, 9:46am

Thank you! This works but I cannot replace with missing if the type is not Union{TYPE, Missing} even though the struct is mutable.

danielw2904 · August 12, 2020, 11:18pm

Quick update:
A basic writer function is implemented. As always please let me know if you need more features or something is not working right. I have not yet implemented tests for the writer but it seems to work based on quick tests I did on my machine.

Next I want to implement a chunk reader that lets the user iterate over the file only keeping one chunk in memory at a time. I think this would be useful for eg filtering the data.

danielw2904 · August 13, 2020, 12:21pm

I have renamed the package to JSONLines.jl and registered it. I guess it will take 3 days for it to be available.

widged · November 15, 2020, 12:09pm

I have been using the jsonl (json lines) format for a few years, thanks to the excellent https://github.com/louischatriot/nedb (a javascript json lines datastore). I would love to have something equivalent (to nedb) in Julia… happy to help make it happen (but be warned that my familiarity with Julia is still limited)

Topic		Replies	Views
[ANN] JSON3.jl - Yet another JSON package for Julia Package Announcements	23	10774	September 19, 2020
Announce: A different way to read JSON data, LazyJSON.jl Data	19	10147	October 2, 2018
[ANN] JSONLines.jl ~1.0.0 and preview to 1.1.0~ 1.2.0 and preview to 1.3.0 Package Announcements package , announcement	1	531	August 21, 2020
JSON Performance Tests Data	7	2312	November 6, 2018
[ANN] JSONLines v.2.0.0 Package Announcements	0	383	September 1, 2020

Initial version of my first package: A JSON Lines reader

Related topics