Initial version of my first package: A JSON Lines reader

IMHO:
with a little change it can be extend to process wikidata json dumps

  • Wikidata Json Dump is very huge , compressed gz > 80GB:
    • latest-all.json.bz2 05-Aug-2020 11:19 58459544787
    • latest-all.json.gz 05-Aug-2020 05:40 87793345321
  • it is “a single JSON array”
  • the structure is similar to JSONLines - with this extra parameters :
    • extra first line [
    • extra last line ]
    • extra comma as a line separator: ",\n" or ",\r\n"
[
{"type":"item","id":"Q31","labels":  .... },
{"type":"item","id":"Q8","labels":   .... },
...
]
  • test command check: zcat latest-all.json.gz | head -n3

because - it is large compressed file ( > 82G ) the keys:

  • compressed file reading
  • thread support ( channels ? ) for filtering … parallel processing
  • writing the filtered result to similar JSON Dump file
  • speed / multi core support …

My use case:

  • pre-processing wikidata JSON dump
  • filtering geodata related items
  • and write a smaller JSON dump OR load to PostGIS database.

IMHO: It is not critical - because I have a Golang script … but for the future it can be an interesting use case for Julia ( and can be a good benchmark ! )

1 Like

See for example this test dataset. I do have pretty big files but most are confidential.

2 Likes

Maybe out of topic but has anyone tried to port simdjson to Julia?

I think that it might be better to have this work exactly to the spec for JSON Lines. This would be more in the spirit of a typical Julia package – it does exactly what the name implies.

IMHO If people need to read JSON files that are similar to, but not the same as, JSON Lines, then they can hack it together using your package and other JSON readers. Perhaps if you have unexported functions that might help in these regards you can document them.

2 Likes

Makes sense. I guess I can internally expose the character expected before \n so that it can be changed by others.

1 Like

Thank you for taking the time to put that together. I’ll try it out.

I tried running this code with a slight modification but it does not seem to work as a DataFrames input. Using the file you provided as test input:

using JSON3: JSON3, StructType, Mutable
using DataFrames
mutable struct Node
    nodeID::String
    nodeUserID::String
    parentID::String
    nodeTime::String
    informationID::String
    Node() = new()
end
StructType(::Type{Node}) = Mutable()
lns = readlines("test.jsonl")
data = [JSON3.read(ln, Node) for ln in lns]
DataFrame(data)
julia> data = [JSON3.read(ln, Node) for ln in lns]
5001-element Array{Node,1}:
 Node(#undef, #undef, #undef, #undef, #undef)
 Node("phkmqmzpbv", "ufmgvgoure", "sgdibnesgi", "2019-03-05 13:10:51", "tyaekgxmsr")
 Node("ugtkjodxer", "fhffxqbsoa", #undef, "2019-03-04 14:09:58", "bolwagblhx")
 Node("alwwtfyunw", "qccqocykfm", "orjhhcvomh", "2019-03-05 01:37:58", "cagyezgppo")
 ⋮
 Node("zuwjjgexbl", "opmkvyipxm", "sxavzrxldl", "2019-03-04 19:09:19", "tswwdiktno")
 Node("ubaomyspwd", "jdoescksnv", #undef, "2019-03-05 15:18:43", "dbermznthm")
 Node("optbrwcfli", "trrheeevlx", "ooxfkaspca", "2019-03-04 21:51:06", "vhdzurjxro")
 Node("mlcxmfplei", "ojshhztncj", "ygotmuetnj", "2019-03-04 18:14:03", "rukafhqowm")

julia> DataFrame(data)
ERROR: UndefRefError: access to undefined reference

Any idea what I am doing wrong? (the modification is Status = new() => Node = new() otherwise I get a Node is not callable error)

EDIT: The problem are the #undef values. I can create e.g. a Tables.columntable out of rows 2 and 4
EDIT2: The following redefinition does work as a Tables.jl input as it avoids undef in construction

using StructTypes
mutable struct Node
    nodeID::Union{String, Missing}
    nodeUserID::Union{String, Missing}
    parentID::Union{String, Missing}
    nodeTime::Union{String, Missing}
    informationID::Union{String, Missing}
    Node() = new(missing, missing, missing, missing, missing)
end
StructTypes.StructType(::Type{Node}) = StructTypes.Mutable()

One thing to remember is that JSON is à la NoSQL in terms of not every record has the same schema. For example, in the example JSONL I shared the first line is a header which has a different schema from lines 2:end. I don’t remember if that particular file consistently uses null for missing vs omitted.

You can also check this other example (will be available for 90 days, just need to download and inflate, 78 lines where each is a JSON Vector).

1 Like

I guess in case the lines do not have the same schema the Tables.jl interface cannot be implemented but it should be possible now to load such a file. The problem with structtypes is that #undef is returned for missing values.

1 Like

So a bunch of updates:

  • Using JSON3.jl for parsing each row
  • Returning the vector of JSON3.Objects since that already works with the Tables.jl interface if the schema allows it. This also removed the DataFrames dependency.
  • New keyword argument: structtype allows users to pass a StructTypes.jl struct to the JSON3.read function for each row (could be that the result still works for Tables.jl but not necessarily due to undef being returned of a value is not available in a row => any suggestions?)

Edit:
Credit for the Tables.jl insight goes to @piever!

3 Likes

Julia natively supports missing, so why not fill in those spots with it?

How can I replace all undefs? I could not find any documentation on this. It is already possible for users to pass a struct that is initialized to missing as a workaround.

I guess you can do use the isassigned function to check if a field is undef or not.

EDIT: I’m not sure if that works for structs. It works for arrays.

1 Like

I’ll try it out

Unfortunately it seems like isassigned only works for arrays. I get errors both for the assigned values and the unassigned ones with

[isassigned(getproperty(data[2], name)) for name in propertynames(data[2])]

Try isdefined. More precisely, try isdefined(data[2], name).

4 Likes

Thank you! This works but I cannot replace with missing if the type is not Union{TYPE, Missing} even though the struct is mutable.

1 Like

Quick update:
A basic writer function is implemented. As always please let me know if you need more features or something is not working right. I have not yet implemented tests for the writer but it seems to work based on quick tests I did on my machine.

Next I want to implement a chunk reader that lets the user iterate over the file only keeping one chunk in memory at a time. I think this would be useful for eg filtering the data.

1 Like

I have renamed the package to JSONLines.jl and registered it. I guess it will take 3 days for it to be available.
:partying_face: :partying_face::partying_face:

6 Likes

I have been using the jsonl (json lines) format for a few years, thanks to the excellent https://github.com/louischatriot/nedb (a javascript json lines datastore). I would love to have something equivalent (to nedb) in Julia… happy to help make it happen (but be warned that my familiarity with Julia is still limited)