Easiest way to load a DataFrame from a compressed, newline delimited json file on the cloud?

s-kap · October 20, 2020, 11:53am

I’m trying to convert a python notebook into Julia.

In python we have
pd.read_json("gs://bucket/file.json.gz", lines=True, compression="gzip")

I don’t think there’s a way to do this.
The approach I’ve taken is to first download the file and then

import Pkg; Pkg.add("DataFrames"); Pkg.add("JSON3"); Pkg.add("CodecZlib")
using DataFrames, CodecZlib, JSON3
path = "..../000000000000.json.gz"
df = open(path) do file
    DataFrame(JSON3.read.(eachline(GzipDecompressorStream(file))))
end

But I end up with some sort of key error, due to this issue that prevents you from populating a table if the first row has a column but a following row does not.

Do you have any other ideas for approaches?

Tamas_Papp · October 20, 2020, 12:07pm

You could just read the JSON and preprocess it before passing to DataFrame. If the file is huge, you could just implement a simple Tables.jl wrapper interface that adds that column to the rows iterator.

quinnj · October 20, 2020, 1:08pm

We’re actively thinking/working on a better overall Tables.jl solution here, but for now, DataFrames.jl has this functionality in push!, so in your case, something like:

df = open(path) do file
    df = DataFrame()
    for line in eachline(GzipDecompressorStream(file))
        push!(df, JSON3.read(line); cols=:union)
    end
end

Topic		Replies	Views
How to read Panda's DataFrames from json file? New to Julia dataframes	27	976	February 1, 2023
How to read .json file in dataframe without using Pandas.jl General Usage json , dataframes	1	271	August 4, 2022
How to read a compressed CSV file? New to Julia	11	4890	January 17, 2019
Processing JSON from a .txt file and converting to a DataFrame New to Julia dataframes , json3	7	2598	May 15, 2021
DataFrames, best way to import from JSON format file Data	6	8454	October 15, 2019

Easiest way to load a DataFrame from a compressed, newline delimited json file on the cloud?

Related topics