Could you try CSVFiles.jl on your dataset?
using DataFrames, CSVFiles, FileIO
crsp = load(File(format"CSV", "crspdaily-clean.csv.gz"), nastrings=[ "NA", "NaN", "" ]) |> DataFrame
# or for already uncompressed file
crsp = load("crspdaily-clean.csv", nastrings=[ "NA", "NaN", "" ]) |> DataFrame
This indicates that TextParse.jl is often faster (or en par) than CSV.jl. CSVFiles.jl uses TextParse.jl under the hood. The benchmarks I posted there showed an additional overhead when using CSVFiles.jl over pure TextParse.jl, but you can get rid of that entirely when using the latest master
version of DataFrames.jl.
Having said that, R’s fread
is a beast and there simply isn’t anything in julia currently that can deliver that level of performance. And I’m only talking about single threaded reads, once freads
starts to use threads, it simply leaves everything else in the dust…