wrangling large json files

Thank you both!

So far the new line delimitation did not seem a huge problem. Something like the following would behave as expected returning a new line at each iteration:

using JSON
using DataFrames

file_json = open("my_file.ndjson", "r")

file_json |>
  JSON.parse |>
  DataFrame

close(file_json)

Quite nicely, the DataFrame does not get upset by the present of nested json in some of the columns.

So, one solution that would replicate the jq workflow would be to save each line on disk as I go through the lines of the ndjson — e.g., using CSV.write(...; append = true) — and feeding that to JuliaDB with loadtable() (it should work, right?).

P.S. @lwabeke I’m not sure I understood correctly what you mean with “many entries on each level of the hierarchy”. The data is almost rectangular (fixed schema, 29 fields for each row, some million rows).