IMHO:
with a little change it can be extend to process wikidata json dumps
- Wikidata Json Dump is very huge , compressed gz > 80GB:
latest-all.json.bz2 05-Aug-2020 11:19 58459544787
latest-all.json.gz 05-Aug-2020 05:40 87793345321
- it is “a single JSON array”
- the structure is similar to JSONLines - with this extra parameters :
- extra first line
[
- extra last line
]
- extra comma as a line separator:
",\n"
or",\r\n"
- extra first line
[
{"type":"item","id":"Q31","labels": .... },
{"type":"item","id":"Q8","labels": .... },
...
]
- test command check:
zcat latest-all.json.gz | head -n3
because - it is large compressed file ( > 82G ) the keys:
- compressed file reading
- thread support ( channels ? ) for filtering … parallel processing
- writing the filtered result to similar JSON Dump file
- speed / multi core support …
My use case:
- pre-processing wikidata JSON dump
- filtering geodata related items
- and write a smaller JSON dump OR load to PostGIS database.
IMHO: It is not critical - because I have a Golang script … but for the future it can be an interesting use case for Julia ( and can be a good benchmark ! )