Reading Data Is Still Too Slow

davidanthoff · November 23, 2018, 6:08pm

Could you try CSVFiles.jl on your dataset?

using DataFrames, CSVFiles, FileIO

crsp = load(File(format"CSV", "crspdaily-clean.csv.gz"), nastrings=[ "NA", "NaN", "" ]) |> DataFrame

# or for already uncompressed file
crsp = load("crspdaily-clean.csv", nastrings=[ "NA", "NaN", "" ]) |> DataFrame

This indicates that TextParse.jl is often faster (or en par) than CSV.jl. CSVFiles.jl uses TextParse.jl under the hood. The benchmarks I posted there showed an additional overhead when using CSVFiles.jl over pure TextParse.jl, but you can get rid of that entirely when using the latest master version of DataFrames.jl.

Having said that, R’s fread is a beast and there simply isn’t anything in julia currently that can deliver that level of performance. And I’m only talking about single threaded reads, once freads starts to use threads, it simply leaves everything else in the dust…

Topic		Replies	Views
My experiences reading CSVs from the Fannie Mae datasets Data performance , csv	62	6143	August 26, 2019
CSV read in is too slow than other language General Usage performance	13	1358	June 21, 2023
Benchmarking ways to write/load DataFrames IndexedTables to disk Data	42	6957	October 25, 2018
CSV Reading (rewrite in C?) Internals & Design	50	5068	October 1, 2018
CSV Reader Benchmarks: Julia Reads CSVs 10-20x Faster than Python and R General Usage announcement	68	8899	March 23, 2022

Reading Data Is Still Too Slow

Related topics