Reading Data Is Still Too Slow

Could you try CSVFiles.jl on your dataset?

using DataFrames, CSVFiles, FileIO

crsp = load(File(format"CSV", "crspdaily-clean.csv.gz"), nastrings=[ "NA", "NaN", "" ]) |> DataFrame

# or for already uncompressed file
crsp = load("crspdaily-clean.csv", nastrings=[ "NA", "NaN", "" ]) |> DataFrame

This indicates that TextParse.jl is often faster (or en par) than CSV.jl. CSVFiles.jl uses TextParse.jl under the hood. The benchmarks I posted there showed an additional overhead when using CSVFiles.jl over pure TextParse.jl, but you can get rid of that entirely when using the latest master version of DataFrames.jl.

Having said that, R’s fread is a beast and there simply isn’t anything in julia currently that can deliver that level of performance. And I’m only talking about single threaded reads, once freads starts to use threads, it simply leaves everything else in the dust…

5 Likes