I would appreciate help with my problem. I have data that come in series. After processing raw input I have about 100 columns and about 10^8 rows per series (and about 100 series in total). So far I have been using HDF5 format to store it, but I would like to test if parquet performs better, especially if I select later rows by filtering some specific column values.
I have imagined a parquet file with row groups, one for each series (the columns and the data format will be always the same, but each series might have been taken with different conditions, so the data interpretation may differ). I have tried the Parquet2.jl module, but I’m kind of stuck with how to achieve row groups in a file (or directory). Here’s a piece of code that generates some pseudo-data
function generate_data(N)
Mmax = 100
hits = zeros(UInt16, Mmax * 2 + 1, N)
for i in 1:N
M = rand(1:10)
hits[1, i] = UInt16(M)
dets = sample(1:Mmax, M, replace=false)
for d in dets
hits[2*(d-1)+2, i] = round(UInt16, rand() .* 30000.0, RoundDown)
t = randn() .* 1000.0 + 5000.0
if t < 0
t = 0.0
end
hits[2*(d-1)+3, i] = round(UInt16, t)
end
end
data = (M=hits[1, :], )
names = ["E", "t"]
for j in 1:Mmax
for k in 1:2
data = merge(data, (Symbol("$(names[k])_$j") => hits[2*(j-1)+k+1, :], ))
end
end
data
end
And this function I’ve tried to append a series to a file
function write_parquet(filename; k=10, N=1_000_000)
open(filename, write=true) do io
fw = Parquet2.FileWriter(io)
for i in 1:k
data = generate_data(N)
Parquet2.writetable!(fw, data)
end
Parquet2.finalize!(fw)
end
end
The resulting file for k=1 has about 82M, and for k=10, 820M, so it seems that the data are written. But upon opening
ds = Parquet2.Dataset(filename)
there is only one group, with 10^6 rows.
In principle it should be possible to append data to unfinished files (e.g. fastparquet write function has “append” keyword that allows to add new rowgroup), but can it be achieved with Parquet2.jl (or Parquet.jl)?