CSV read performance vs Pandas

Any idea why Pandas’ CSV parser is so awesome? Below is my test for a 20MB file with 100,000 rows x 12 columns, mostly Float64’s.

julia> @benchmark CSV.read("/tmp/test.csv")
  memory estimate:  163.82 MiB
  allocs estimate:  9108142
  minimum time:     1.226 s (22.79% GC)
  median time:      1.350 s (29.04% GC)
  mean time:        1.347 s (28.63% GC)
  maximum time:     1.464 s (32.77% GC)
  samples:          4
  evals/sample:     1

julia> @benchmark Pandas.read_csv("/tmp/test.csv")
  memory estimate:  10.89 KiB
  allocs estimate:  83
  minimum time:     429.263 ms (0.00% GC)
  median time:      439.544 ms (0.00% GC)
  mean time:        440.220 ms (0.00% GC)
  maximum time:     451.383 ms (0.00% GC)
  samples:          12
  evals/sample:     1

shell> ls -l /tmp/test.csv
-rw-r--r--  1 tomkwong  wheel  21168794 Mar  8 22:06 /tmp/test.csv

shell> wc /tmp/test.csv
  100000  100000 21168794 /tmp/test.csv

R’s data.table is also fast

1 Like

I can’t reproduce your timings… for a 100,000 x 12 table of random floats, I get 70ms with Julia and 129ms with Pandas.

Performance does suffer a bit with mixed types. In t2.csv, I changed the first column to ints (notice the huge increase in allocations), but still nothing like your timings. What types are you using?

shell> wc -l /tmp/t.csv
100001 /tmp/t.csv

shell> head /tmp/t.csv

julia> @benchmark CSV.read("/tmp/t.csv")
  memory estimate:  12.31 MiB
  allocs estimate:  196
  minimum time:     68.950 ms (0.00% GC)
  median time:      69.678 ms (0.70% GC)
  mean time:        70.810 ms (2.27% GC)
  maximum time:     118.171 ms (40.81% GC)
  samples:          71
  evals/sample:     1

shell> wc -l /tmp/t2.csv
100001 /tmp/t2.csv

shell> head /tmp/t2.csv

julia> @benchmark CSV.read("/tmp/t2.csv")
  memory estimate:  47.32 MiB
  allocs estimate:  2294086
  minimum time:     95.924 ms (1.56% GC)
  median time:      96.489 ms (1.58% GC)
  mean time:        98.969 ms (3.59% GC)
  maximum time:     154.921 ms (33.72% GC)
  samples:          51
  evals/sample:     1

julia> @benchmark Pandas.read_csv("/tmp/t.csv")
  memory estimate:  11.36 KiB
  allocs estimate:  97
  minimum time:     126.454 ms (0.00% GC)
  median time:      129.265 ms (0.00% GC)
  mean time:        129.198 ms (0.00% GC)
  maximum time:     131.443 ms (0.00% GC)
  samples:          39
  evals/sample:     1

julia> @benchmark Pandas.read_csv("/tmp/t2.csv")
  memory estimate:  11.36 KiB
  allocs estimate:  97
  minimum time:     120.393 ms (0.00% GC)
  median time:      123.140 ms (0.00% GC)
  mean time:        122.649 ms (0.00% GC)
  maximum time:     124.729 ms (0.00% GC)
  samples:          41
  evals/sample:     1

julia> versioninfo(verbose=true)
Julia Version 1.2.0-DEV.unknown
Commit 60457de* (2019-03-08 17:26 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  uname: Linux 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 unknown
  CPU: Intel(R) Xeon(R) CPU E3-1220 v5 @ 3.00GHz:
              speed         user         nice          sys         idle          irq
       #1  3500 MHz    3792225 s       9640 s      88120 s  373573962 s          0 s
       #2  3474 MHz    1353493 s      12106 s      65979 s  377578307 s          0 s
       #3  3500 MHz    1468507 s      13422 s      65814 s  377466150 s          0 s
       #4  3480 MHz    1371725 s      13179 s      68439 s  377578218 s          0 s

  Memory: 62.74964904785156 GB (27196.83203125 MB free)
  Uptime: 3.790843e6 sec
  Load Avg:  0.0  0.0322265625  0.0029296875
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
  HOME = /home/me
  TERM = screen-256color
  PATH = /home/me/.cargo/bin:/home/me/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl

I’m using a somewhat outdated MacBook Pro. Haven’t tried Linux yet. But if you compare with Pandas then you should see the difference at your end.

Here is how I generated the test data. Basically 10 floats, a 5-char string and a 10-char string surrounded by double quotes.

julia> n = 100000; open("/tmp/test.csv", "w") do f
         for i in 1:n
           println(f, join(vcat(rand(10), randstring(5), '"'*randstring(10)*'"'), ","))

Julia’s CSV seems to be nearly 3x slower at parsing floating point numbers with 17 vs 16 digits after the decimal point. I’d file a bug.

julia> n = 100000;
       open("/tmp/test.csv", "w") do f
           for i in 1:n
               println(f, join(vcat(round.(rand(10), digits=16)), ","))

julia> @benchmark CSV.read("/tmp/test.csv")
  memory estimate:  32.50 MiB
  allocs estimate:  1305362
  minimum time:     126.785 ms (0.00% GC)
  median time:      152.950 ms (13.80% GC)
  mean time:        162.647 ms (15.15% GC)
  maximum time:     230.151 ms (36.65% GC)
  samples:          31
  evals/sample:     1

julia> n = 100000;
       open("/tmp/test.csv", "w") do f
           for i in 1:n
               println(f, join(vcat(round.(rand(10), digits=17)), ","))

julia> @benchmark CSV.read("/tmp/test.csv")
  memory estimate:  109.93 MiB
  allocs estimate:  5998616
  minimum time:     273.352 ms (0.00% GC)
  median time:      417.435 ms (22.99% GC)
  mean time:        441.242 ms (27.28% GC)
  maximum time:     767.849 ms (48.71% GC)
  samples:          12
  evals/sample:     1

16 digits is Float64 accuracy. That sounds suspicious…

I have a pretty comprehensive CSV benchmark code here. I ran it last in December and got these numbers.

I would expect that the Float64 with many digits issue in CSV.jl (which you can also see very clearly in my benchmark results: there is one file uniform floats and one with short floats, that refers to the number of digits) should be better with this PR that was just merged but hasn’t made it through METADATA yet.

I’m rerunning my benchmark right now with that PR in Parsers.jl to see how things stack up at this point. But running them takes a while, will probably be another day or so until it is done.


Yep, huge improvement with Parsers#master!
Over a 3x speed improvement and gets rid of most of the allocations with 17 digits:

julia> n = 100000;
       open("/tmp/test.csv", "w") do f
           for i in 1:n
               println(f, join(vcat(round.(rand(10), digits=17)), ","))

julia> @benchmark CSV.read("/tmp/test.csv")
  memory estimate:  10.59 MiB
  allocs estimate:  175
  minimum time:     135.110 ms (0.00% GC)
  median time:      135.647 ms (0.00% GC)
  mean time:        138.449 ms (1.98% GC)
  maximum time:     183.970 ms (25.83% GC)
  samples:          37
  evals/sample:     1

Alright, I reran my benchmark suite, and results are here. There are still some things about that benchmark results page that I want to improve before I announce it more widely, so please do take a peak and give feedback, but maybe don’t yet promote it around because I wouldn’t want folks to cite these numbers yet, before I’ve added a few more things.

In particular, these numbers are for a case where the CSV reader has read a file with exactly the same schema (column #, names and types) before this benchmark is run. That makes the julia packages look artificially good, because all JIT compiling is excluded from these runs (and in particular, JIT compiling that is schema specific).

I have the results for runs where that isn’t the case, just need to figure out how to present them in this figure without making it entirely unwieldy. My current thinking is to add a drop down box so that one can interactively select that, if anyone wants to help out, the way to do that would be to modify the vega-lite spec that powers this here. The easiest workflow for that is to click on the three dots in the top right of the figure on the benchmark page, select “Open in Vega editor”, and then edit the spec there until it does what it should do.


Am I right in thinking that Julia will never match the performance of other optimised CSV readers, because of the need for compilation? TextParse looks pretty fast without compilation but compilation will always be needed, because every CSV file is different and different columns have different types and therefore requires different code to be generated.

Basically, I am starting to form the view that reading from CSV is one area where Julia can’t solve the two language problem, if we care about performance, because of the need for compilation.

More happy to be wrong on this!

TableReader.jl is comparable or often faster than pandas in many real-workd data sets I’ve benchmarked, including the start-up time required for JIT compiling. Here is a quick benchmarking result. The first file (parking-citations.csv) is a 1.2 GB CSV with 19 columns and mixed types including integers, floats, strings, datetimes, missing values. The second file (GSM2230757_human1_umifm_counts.csv) is almost an integer matrix except the first few string columns. This file has more columns than rows (1,938 rows x 20,128 columns) and thus other Julia packages are not good at handling it (see https://github.com/JuliaData/CSV.jl/issues/388). In both files, TableReader.jl is faster than pandas by obvious margin. Of couse, it is written in pure Julia and thus the two-language problem is clearly avoided :wink:


~/w/TableReader (master|…) $ ipython
^PPython 3.7.2 (v3.7.2:9a3ffc0492, Dec 24 2018, 02:44:43)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.2.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas

In [2]: %time df = pandas.read_csv("data/parking-citations.csv", low_memory=False)
CPU times: user 33.2 s, sys: 4.31 s, total: 37.5 s
Wall time: 37.5 s

In [3]: %time df = pandas.read_csv("data/GSM2230757_human1_umifm_counts.csv")
CPU times: user 19.2 s, sys: 863 ms, total: 20 s
Wall time: 20 s


~/w/TableReader (master|…) $ julia
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.1.0 (2019-01-21)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using TableReader

julia> @time df = readcsv("data/parking-citations.csv", chunksize = 0);
 23.972916 seconds (59.34 M allocations: 5.699 GiB, 12.75% gc time)

julia> @time df = readcsv("data/GSM2230757_human1_umifm_counts.csv");
  4.353241 seconds (3.50 M allocations: 1.054 GiB, 55.04% gc time)

I also have a plan to get rid of the Schema specific compile stuff for TextParse.jl.

Given you have access to the data can you try R’s data.table?


Here is a result:

~/w/TableReader (master|…) $ R

R version 3.5.2 (2018-12-20) -- "Eggshell Igloo"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin15.6.0 (64-bit)


> library(data.table)
data.table 1.12.0  Latest news: r-datatable.com
> system.time(data.table::fread("data/parking-citations.csv"))
   user  system elapsed
163.558   4.204  26.779
> system.time(data.table::fread("data/GSM2230757_human1_umifm_counts.csv"))
   user  system elapsed
  1.048   0.151   1.403
1 Like

Oh wow looks like TableReader is very competitve. Hopefully you can fix the ‘|’ delimiter issue so i can test it on the Fannie Mae data.

Submitted a bug report. It doesn’t work on the Fannie Mae data yet.

@bicycle1885 how did you make TableReader so much faster than all other Julia readers? Can give some high level strategies you have followed?

1 Like

I’m going to officially announce TableReader.jl here with some design concepts once it is registered in METADATA (https://github.com/JuliaLang/METADATA.jl/pull/22346). But briefly speaking, I think horizontal tokenization using a finite state automaton and vertical parsing using a string cache etc. make TableReader.jl faster than other packages. These strategies are chosen in my experience of making fast text parsers in the field of bioinformatics (for example, https://github.com/BioJulia/Automa.jl is my product). Also, modest specialization of methods reduces the start-up delay especially when there are a lot of columns in the data file. https://github.com/bicycle1885/TranscodingStreams.jl makes this practical.


Are there any strategies that can further improve the the performance when multithreading comes around? I learned that expertise really matters.

This feels weird. I am not the approach. Clearly the approach is fantastic, but the fact that we need to make this compromise. Of course full specialization will result in faster code, but the compilation will slow it down! This problem seems unique to Julia.

Theoretically, if I build a process to read CSV files in the same format every month (with new data of course), then I would go for the fully compiled approach. In fact I would try to save the compiled code, especially the code is dispatched on all column types. But it feels like for interactive use, more modest specialization is sometimes better. To balance this trade-off in practice increases mental load when developing packages. No solution from me, just some observations that I think are interesting.