CSV parsing performance


#1

I’m playing around with CSV parsers right now (CSV.jl and TextParse.jl), trying to figure out which one is faster. @quinnj showed some impressive performance numbers in his juliacon talk that indicated CSV.jl is really good on performance, but in my (limited) tests so far, @shashi s TextParse.jl seems to come out ahead. I should caveat that I missed Jacob’s talk, so I’m going off the slides, i.e. I might miss some context that he gave verbally during his talk.

Have others done some perf comparisons? What have you found? Any experience so far which of these packages does better for which kind of dataset?

Here is the code I used to benchmark things:

using DataFrames, CSV, TextParse, Pandas

function writedata()
    df = DataFrame(a=rand(10_000_000), b=rand(10_000_000),
        c=rand(10_000_000), d=[randstring(100) for i=1:10_000_000])

    writetable("data.csv", df)
end

writedata()

function foo(filename)
    data, col_names = csvread("data.csv", pooledstrings=false, type_detect_rows=100)
    return DataFrame([i for i in data], Symbol.(col_names))
end

# Precompile
@time CSV.read("data.csv");
@time foo("data.csv");
@time read_csv("data.csv");

@time CSV.read("data.csv");
@time foo("data.csv");
@time read_csv("data.csv");

I get the following timings:

# CSV.jl
julia> @time CSV.read("data.csv");
 75.822403 seconds (111.18 M allocations: 3.672 GiB, 15.90% gc time)

# TextParse.jl
julia> @time foo("data.csv");
  9.249950 seconds (10.01 M allocations: 1.614 GiB, 21.04% gc time)

# Pandas.jl
julia> @time read_csv("data.csv");
 28.367430 seconds (32 allocations: 960 bytes)

#2

Here are some more timings for various R ways to read the same file:

> system.time(fread("data.csv"))
Read 10000000 rows and 4 (of 4) columns from 1.498 GB file in 00:00:57
   user  system elapsed 
  54.76    0.68   56.57 


> system.time(read_csv("data.csv"))
Parsed with column specification:
cols(
  a = col_double(),
  b = col_double(),
  c = col_double(),
  d = col_character()
)
|===========================================================| 100% 1533 MB
   user  system elapsed 
  35.91    0.63   37.68 

#3

Is it possible that @quinnj hasn’t published the work he did in the CSV-parsing arms race yet? The TextParse timing is so impressive!


#4

Yes, I tried to reiterate a few times during my talk that the figures I shared were based on the combination of 3 unmerged branches against Base Julia and 2 unmerged branches against DataStreams.jl & CSV.jl. I’m just coming off summer vacation, so I’ll try to run your example and post the timings across all my branches (if they still work :S). Top of my TODO list is getting all those branches merged in in the next few weeks.


#5

There it is: https://youtu.be/z1azbEDDIy8?t=7m27s