Parquet: writing data as row groups

kmiernik · July 19, 2024, 5:45pm

I would appreciate help with my problem. I have data that come in series. After processing raw input I have about 100 columns and about 10^8 rows per series (and about 100 series in total). So far I have been using HDF5 format to store it, but I would like to test if parquet performs better, especially if I select later rows by filtering some specific column values.

I have imagined a parquet file with row groups, one for each series (the columns and the data format will be always the same, but each series might have been taken with different conditions, so the data interpretation may differ). I have tried the Parquet2.jl module, but I’m kind of stuck with how to achieve row groups in a file (or directory). Here’s a piece of code that generates some pseudo-data

function generate_data(N)
    Mmax = 100
    hits = zeros(UInt16, Mmax * 2 + 1, N)

    for i in 1:N
        M = rand(1:10)
        hits[1, i] = UInt16(M)
        dets = sample(1:Mmax, M, replace=false)
        for d in dets
            hits[2*(d-1)+2, i] = round(UInt16, rand() .* 30000.0, RoundDown)
            t = randn() .* 1000.0 + 5000.0
            if t < 0
                t = 0.0
            end
            hits[2*(d-1)+3, i] = round(UInt16, t)
        end
    end

    data = (M=hits[1, :], )
    names = ["E", "t"]
    for j in 1:Mmax
        for k in 1:2
            data = merge(data, (Symbol("$(names[k])_$j") => hits[2*(j-1)+k+1, :], ))
        end
    end
    
    data
end

And this function I’ve tried to append a series to a file

function write_parquet(filename; k=10, N=1_000_000)
    open(filename, write=true) do io
        fw = Parquet2.FileWriter(io)
        for i in 1:k
            data = generate_data(N)
            Parquet2.writetable!(fw, data) 
        end
        Parquet2.finalize!(fw)
    end
end

The resulting file for k=1 has about 82M, and for k=10, 820M, so it seems that the data are written. But upon opening

ds = Parquet2.Dataset(filename)

there is only one group, with 10^6 rows.

In principle it should be possible to append data to unfinished files (e.g. fastparquet write function has “append” keyword that allows to add new rowgroup), but can it be achieved with Parquet2.jl (or Parquet.jl)?

kmiernik · July 22, 2024, 11:38am

My mistake, the above function works properly. Upon opening the resulting file, there are k groups

ds = Parquet2.Dataset("test.prq")
Parquet2.nrowgroups(ds)

and ds can be iterated over.

Topic		Replies	Views
How to write in .parquet (or any compressed extension) General Usage question , parquet	3	595	August 29, 2021
Writing Parquet files General Usage	28	5234	November 12, 2020
[ANN] Parquet2.jl Package Announcements data , parquet , tables , serialization	20	7390	May 8, 2024
Trying to write a parquet writer. Please help! General Usage parquet	1	504	May 9, 2020
Strategies for Parallelizing Long Tasks with Many Parts General Usage parallel	0	202	December 29, 2023

Parquet: writing data as row groups

Related topics