Reading Data Is Still Too Slow

davidanthoff · November 24, 2018, 11:47pm

I tried to load a feather file now that seems roughly similar to the CRSP file described here: 90 million rows, a mix of Int and Float64 columns. The file is about 8.5 GB on disc.

With the current released version of DataFrames.jl, it takes about 80 seconds to load on my system (I do have a very fast system ). When I use the master branch of DataFrames.j, it takes somewhere between 7-15 seconds to load. All of these numbers are for FeatherFiles.jl.

So I suspect (or hope ) that @iwelch’s numbers from above are with the released DataFrames.jl, in which case we might actually have something very competitive once we get a new DataFrames.jl release out.

Caveat is that I haven’t tried a column with missing values yet.

Topic		Replies	Views
My experiences reading CSVs from the Fannie Mae datasets Data performance , csv	62	6143	August 26, 2019
CSV read in is too slow than other language General Usage performance	13	1358	June 21, 2023
Benchmarking ways to write/load DataFrames IndexedTables to disk Data	42	6957	October 25, 2018
CSV Reading (rewrite in C?) Internals & Design	50	5068	October 1, 2018
CSV Reader Benchmarks: Julia Reads CSVs 10-20x Faster than Python and R General Usage announcement	68	8899	March 23, 2022

Reading Data Is Still Too Slow

Related topics