Concatenate CSV data into Feather using DataStreams

I have about 400G of data in 15 .csv.gz files, overall 2e9 records with 20 columns, same schema in each file. I would like to read all of them and save in a single .feather file for further analysis.

I could not figure out how to do this using DataStreams, I tried to use append = true but file is not appended but overwritten (MWE below).

Also, I am wondering if I can do this directly if the total data would not fit into memory (which is the case), since Feather.jl apparently needs to read in the old data before appending. Or would mmap bypass this difficulty? If not, should I use an interim step, eg an SQLite database? In any case, the MWE is

using DataStreams
using CSV
using Feather
using DataTables

# data for the MWE
dt = DataTable(A = 1:10, B = 99:-1:90)
CSV.write("/tmp/test1.csv", dt)
CSV.write("/tmp/test2.csv", dt)

# read and copy first file
src1 = CSV.Source("/tmp/test1.csv")
snk = Feather.Sink("/tmp/test.feather", Data.schema(src1))
Data.stream!(src1, snk)

# read and copy second file
src2 = CSV.Source("/tmp/test2.csv")
snk = Feather.Sink("/tmp/test.feather", Data.schema(src2); append = true)
Data.stream!(src2, snk)

# close
Data.close!(snk)

# data is not appended but overwritten
src_all = Feather.Source("/tmp/test.feather")

Using 0.6-rc1, released versions of all packages.

1 Like

And this is how I would do it using SQLite, is this the right way?

using SQLite
using CSV
using DataStreams
using Feather

db = SQLite.DB("/tmp/test.db")

src1 = CSV.Source("/tmp/test1.csv")
SQLite.load(db, "mytable", src1)

src2 = CSV.Source("/tmp/test2.csv")
SQLite.load(db, "mytable", src2; append = true)

src_all = SQLite.Source(db, "SELECT * FROM mytable")
Feather.write("/tmp/test.feather", src_all)
1 Like

Did you figure it out?

Nope, wrote my own library:
https://github.com/tpapp/LargeColumns.jl
This is amazingly fast for “ingest once, use repeatedly” workflows (the kernel/mmap does the work transparently). Open an issue if I can help with any extension.

It says bits types only, so it doesn’t work with strings. Anyway, I finally got it to run using CSV but it’s about 3x slower than R’s datat.table’s (awesome) fread.

One could collect strings and mmap into them, but that requires a different approach. What I do is that I accumulate strings in a hash table, mapping to an integer, which I then store using the library above. I reclaim it via IndirectArrays.jl. This is very efficient, I have far more observations than unique strings.