I have about 400G of data in 15 .csv.gz
files, overall 2e9 records with 20 columns, same schema in each file. I would like to read all of them and save in a single .feather
file for further analysis.
I could not figure out how to do this using DataStreams
, I tried to use append = true
but file is not appended but overwritten (MWE below).
Also, I am wondering if I can do this directly if the total data would not fit into memory (which is the case), since Feather.jl
apparently needs to read in the old data before appending. Or would mmap
bypass this difficulty? If not, should I use an interim step, eg an SQLite
database? In any case, the MWE is
using DataStreams
using CSV
using Feather
using DataTables
# data for the MWE
dt = DataTable(A = 1:10, B = 99:-1:90)
CSV.write("/tmp/test1.csv", dt)
CSV.write("/tmp/test2.csv", dt)
# read and copy first file
src1 = CSV.Source("/tmp/test1.csv")
snk = Feather.Sink("/tmp/test.feather", Data.schema(src1))
Data.stream!(src1, snk)
# read and copy second file
src2 = CSV.Source("/tmp/test2.csv")
snk = Feather.Sink("/tmp/test.feather", Data.schema(src2); append = true)
Data.stream!(src2, snk)
# close
Data.close!(snk)
# data is not appended but overwritten
src_all = Feather.Source("/tmp/test.feather")
Using 0.6-rc1
, released versions of all packages.