Easiest way to load a DataFrame from a compressed, newline delimited json file on the cloud?

I’m trying to convert a python notebook into Julia.

In python we have
pd.read_json("gs://bucket/file.json.gz", lines=True, compression="gzip")

I don’t think there’s a way to do this.
The approach I’ve taken is to first download the file and then

import Pkg; Pkg.add("DataFrames"); Pkg.add("JSON3"); Pkg.add("CodecZlib")
using DataFrames, CodecZlib, JSON3
path = "..../000000000000.json.gz"
df = open(path) do file
    DataFrame(JSON3.read.(eachline(GzipDecompressorStream(file))))
end

But I end up with some sort of key error, due to this issue that prevents you from populating a table if the first row has a column but a following row does not.

Do you have any other ideas for approaches?

You could just read the JSON and preprocess it before passing to DataFrame. If the file is huge, you could just implement a simple Tables.jl wrapper interface that adds that column to the rows iterator.

We’re actively thinking/working on a better overall Tables.jl solution here, but for now, DataFrames.jl has this functionality in push!, so in your case, something like:

df = open(path) do file
    df = DataFrame()
    for line in eachline(GzipDecompressorStream(file))
        push!(df, JSON3.read(line); cols=:union)
    end
end
3 Likes