Concatenate CSV data into Feather using DataStreams

Tamas_Papp · May 13, 2017, 12:23pm

I have about 400G of data in 15 .csv.gz files, overall 2e9 records with 20 columns, same schema in each file. I would like to read all of them and save in a single .feather file for further analysis.

I could not figure out how to do this using DataStreams, I tried to use append = true but file is not appended but overwritten (MWE below).

Also, I am wondering if I can do this directly if the total data would not fit into memory (which is the case), since Feather.jl apparently needs to read in the old data before appending. Or would mmap bypass this difficulty? If not, should I use an interim step, eg an SQLite database? In any case, the MWE is

using DataStreams
using CSV
using Feather
using DataTables

# data for the MWE
dt = DataTable(A = 1:10, B = 99:-1:90)
CSV.write("/tmp/test1.csv", dt)
CSV.write("/tmp/test2.csv", dt)

# read and copy first file
src1 = CSV.Source("/tmp/test1.csv")
snk = Feather.Sink("/tmp/test.feather", Data.schema(src1))
Data.stream!(src1, snk)

# read and copy second file
src2 = CSV.Source("/tmp/test2.csv")
snk = Feather.Sink("/tmp/test.feather", Data.schema(src2); append = true)
Data.stream!(src2, snk)

# close
Data.close!(snk)

# data is not appended but overwritten
src_all = Feather.Source("/tmp/test.feather")

Using 0.6-rc1, released versions of all packages.

Tamas_Papp · May 13, 2017, 12:55pm

And this is how I would do it using SQLite, is this the right way?

using SQLite
using CSV
using DataStreams
using Feather

db = SQLite.DB("/tmp/test.db")

src1 = CSV.Source("/tmp/test1.csv")
SQLite.load(db, "mytable", src1)

src2 = CSV.Source("/tmp/test2.csv")
SQLite.load(db, "mytable", src2; append = true)

src_all = SQLite.Source(db, "SELECT * FROM mytable")
Feather.write("/tmp/test.feather", src_all)

xiaodai · February 1, 2018, 4:44am

Did you figure it out?

Tamas_Papp · February 1, 2018, 5:45am

Nope, wrote my own library:
https://github.com/tpapp/LargeColumns.jl
This is amazingly fast for “ingest once, use repeatedly” workflows (the kernel/mmap does the work transparently). Open an issue if I can help with any extension.

xiaodai · February 1, 2018, 6:00am

It says bits types only, so it doesn’t work with strings. Anyway, I finally got it to run using CSV but it’s about 3x slower than R’s datat.table’s (awesome) fread.

Tamas_Papp · February 1, 2018, 6:50am

One could collect strings and mmap into them, but that requires a different approach. What I do is that I accumulate strings in a hash table, mapping to an integer, which I then store using the library above. I reclaim it via IndirectArrays.jl. This is very efficient, I have far more observations than unique strings.

Topic		Replies	Views
Reading large-columned data using Feather.jl is too slow Data question , package	8	738	June 28, 2020
CSV Error - Feather Error Data question , package , data , dataframes , csv	7	958	November 4, 2020
ANN: Feather.jl v0.4.0 (lazy edition) Data	2	1024	August 29, 2018
Vcat list of dataframes in for loop General Usage	4	1045	October 22, 2019
[ANN] Fread.jl - read CSVs faster with the help of R's {data.table} Package Announcements performance , data , csv	6	2055	October 9, 2019

Concatenate CSV data into Feather using DataStreams

Related topics