I’m playing around with CSV parsers right now (CSV.jl and TextParse.jl), trying to figure out which one is faster. @quinnj showed some impressive performance numbers in his juliacon talk that indicated CSV.jl is really good on performance, but in my (limited) tests so far, @shashi s TextParse.jl seems to come out ahead. I should caveat that I missed Jacob’s talk, so I’m going off the slides, i.e. I might miss some context that he gave verbally during his talk.
Have others done some perf comparisons? What have you found? Any experience so far which of these packages does better for which kind of dataset?
Here is the code I used to benchmark things:
using DataFrames, CSV, TextParse, Pandas
function writedata()
df = DataFrame(a=rand(10_000_000), b=rand(10_000_000),
c=rand(10_000_000), d=[randstring(100) for i=1:10_000_000])
writetable("data.csv", df)
end
writedata()
function foo(filename)
data, col_names = csvread("data.csv", pooledstrings=false, type_detect_rows=100)
return DataFrame([i for i in data], Symbol.(col_names))
end
# Precompile
@time CSV.read("data.csv");
@time foo("data.csv");
@time read_csv("data.csv");
@time CSV.read("data.csv");
@time foo("data.csv");
@time read_csv("data.csv");
I get the following timings:
# CSV.jl
julia> @time CSV.read("data.csv");
75.822403 seconds (111.18 M allocations: 3.672 GiB, 15.90% gc time)
# TextParse.jl
julia> @time foo("data.csv");
9.249950 seconds (10.01 M allocations: 1.614 GiB, 21.04% gc time)
# Pandas.jl
julia> @time read_csv("data.csv");
28.367430 seconds (32 allocations: 960 bytes)