I have about 400G of data in 15
.csv.gz files, overall 2e9 records with 20 columns, same schema in each file. I would like to read all of them and save in a single
.feather file for further analysis.
I could not figure out how to do this using
DataStreams, I tried to use
append = true but file is not appended but overwritten (MWE below).
Also, I am wondering if I can do this directly if the total data would not fit into memory (which is the case), since
Feather.jl apparently needs to read in the old data before appending. Or would
mmap bypass this difficulty? If not, should I use an interim step, eg an
SQLite database? In any case, the MWE is
using DataStreams using CSV using Feather using DataTables # data for the MWE dt = DataTable(A = 1:10, B = 99:-1:90) CSV.write("/tmp/test1.csv", dt) CSV.write("/tmp/test2.csv", dt) # read and copy first file src1 = CSV.Source("/tmp/test1.csv") snk = Feather.Sink("/tmp/test.feather", Data.schema(src1)) Data.stream!(src1, snk) # read and copy second file src2 = CSV.Source("/tmp/test2.csv") snk = Feather.Sink("/tmp/test.feather", Data.schema(src2); append = true) Data.stream!(src2, snk) # close Data.close!(snk) # data is not appended but overwritten src_all = Feather.Source("/tmp/test.feather")
0.6-rc1, released versions of all packages.