CSV Reading (rewrite in C?)

I am toying with the idea of writing a very fast CSV parser in C that can handle most of the relevant use cases.

I can write the C parser, but I would need an example of a C function returning a DataFrame to julia. ideally, 3 columns, 4 values. each column of type Union{Missing, x}, where x can be String, Float64, or Int64. is this hard or easy?

1 Like

First question is why don’t you just write it in Julia, and the second is hasn’t that already been done, at least twice?

4 Likes

We just released a new version of CSV.jl that should be faster than ever (faster than other parsers I know of). I’d be interested to hear of any cases that are currently slow.

8 Likes

I thought of C because the one thing C is good at is fast and memory-efficient character-by-character processing. I am happy to abandon the effort.

my last CSV.read() benchmark was from about a month ago. I just picked up the most recent CSV (0.4.1), which I am running on Julia 1.0.0.

CSV: v0.4.1
input test file: 6.2GB file. 89 million lines.
computer: imac pro, 64GB RAM.

Uncompressed Files

Julia

julia> using CSV, GZip, DataFrames

julia> @time x=CSV.read( "cd.csv" );
164.901316 seconds (2.17 G allocations: 47.604 GiB, 19.41% gc time)

julia> @time x=CSV.read( "cd.csv" );
284.668111 seconds (2.13 G allocations: 46.804 GiB, 53.43% gc time)

julia> @time x=CSV.read( "cd.csv" );
301.857143 seconds (2.13 G allocations: 46.804 GiB, 55.82% gc time)

julia> GC.gc()

julia> @time x=CSV.read( "cd.csv" );
299.862575 seconds (2.13 G allocations: 46.804 GiB, 55.61% gc time)

(I know @btime is better for short function benchmarking, but this is one long run.)

(I also tried to give NA as a missing string, but it did not matter—231 seconds.)

R

The “best of breed” is the data.table reader. It reads 99% of all csv files I have ever had to deal with, but makes a few heuristic choices from the first 5, middle 5, and final 5 lines.

> library(data.table)
> t= Sys.time(); x= fread("cd.csv"); Sys.time()-t
Read 88915607 rows and 12 (of 12) columns from 6.202 GB file in 00:00:48
Time difference of 48.15 secs
> t= Sys.time(); x= fread("cd.csv"); Sys.time()-t
Read 88915607 rows and 12 (of 12) columns from 6.202 GB file in 00:00:48
Time difference of 48 secs

The standard R read.csv function was far worse.

Conclusion

The Julia CSV.read version is a factor 4-8 slower than the R read.table. Both return data frames.

Compressed Files

I care about CSV.read speed especially for large files, which are therefore also often compressed. I took the same file, compressed it with gzip, and then reexecuted it. the R version took about 1.2 minutes, including running the input through the Unix gzcat popen pipeline.

I did not have much luck with the Julia version…

First, my attempt at the most similar Unix-piped version.

julia> open( pipeline( `gzcat crspdaily-head.csv.gz`), "r") do f; CSV.read(f); end
ERROR: MethodError: no method matching position(::Base.Process)
Closest candidates are:
  position(::GZipStream) at /Users/ivo/.julia/packages/GZip/LD2ly/src/GZip.jl:350
  position(::GZipStream, ::Bool) at /Users/ivo/.julia/packages/GZip/LD2ly/src/GZip.jl:350
  position(::Base.SecretBuffer) at secretbuffer.jl:154```

I think this fails because CSV.read() is not one-pass…or I have made another one of my silly mistakes, where I don’t know if it’s me or if it’s julia.

So, I tried another version, native decompression. I think GZip here fails, but I need both GZip and CSV.read to work together:

julia> @time x= CSV.read( GZip.open("crspdaily-clean.csv.gz", "r") )

signal (11): Segmentation fault: 11
in expression starting at no file:0
MurmurHash3_x64_128 at /Users/osx/buildbot/slave/package_osx64/build/src/support/./MurmurHash3.c:310
memhash_seed at /Users/osx/buildbot/slave/package_osx64/build/src/support/hashing.c:74
hash at /Users/ivo/.julia/packages/Parsers/WBoUR/src/strings.jl:21 [inlined]
hash at ./hashing.jl:18 [inlined]
hashindex at ./dict.jl:169 [inlined]
ht_keyindex2! at ./dict.jl:309
intern! at /Users/ivo/.julia/packages/Parsers/WBoUR/src/strings.jl:4 [inlined]
intern at /Users/ivo/.julia/packages/Parsers/WBoUR/src/strings.jl:15 [inlined]
#defaultparser#32 at /Users/ivo/.julia/packages/Parsers/WBoUR/src/strings.jl:157 [inlined]
defaultparser at /Users/ivo/.julia/packages/Parsers/WBoUR/src/strings.jl:48 [inlined]
#parse!#31 at /Users/ivo/.julia/packages/Parsers/WBoUR/src/strings.jl:40 [inlined]
parse! at /Users/ivo/.julia/packages/Parsers/WBoUR/src/strings.jl:40 [inlined]
#parse!#30 at /Users/ivo/.julia/packages/Parsers/WBoUR/src/strings.jl:38 [inlined]
parse! at /Users/ivo/.julia/packages/Parsers/WBoUR/src/strings.jl:38 [inlined]
#parse!#29 at /Users/ivo/.julia/packages/Parsers/WBoUR/src/strings.jl:36 [inlined]
parse! at /Users/ivo/.julia/packages/Parsers/WBoUR/src/strings.jl:36 [inlined]
#parse!#28 at /Users/ivo/.julia/packages/Parsers/WBoUR/src/strings.jl:34 [inlined]
parse! at /Users/ivo/.julia/packages/Parsers/WBoUR/src/strings.jl:34 [inlined]
#parse!#27 at /Users/ivo/.julia/packages/Parsers/WBoUR/src/strings.jl:32 [inlined]
parse! at /Users/ivo/.julia/packages/Parsers/WBoUR/src/strings.jl:32 [inlined]
readsplitline at /Users/ivo/.julia/packages/CSV/uLyo0/src/filedetection.jl:195
datalayout at /Users/ivo/.julia/packages/CSV/uLyo0/src/filedetection.jl:123
unknown function (ip: 0x10d82dfef)
jl_fptr_trampoline at /Users/osx/buildbot/slave/package_osx64/build/src/gf.c:1829
#File#1 at /Users/ivo/.julia/packages/CSV/uLyo0/src/CSV.jl:165
unknown function (ip: 0x10d82d527)
jl_fptr_trampoline at /Users/osx/buildbot/slave/package_osx64/build/src/gf.c:1829
Type at /Users/ivo/.julia/packages/CSV/uLyo0/src/CSV.jl:138 [inlined]
#read#101 at /Users/ivo/.julia/packages/CSV/uLyo0/src/CSV.jl:304
unknown function (ip: 0x10d82bf67)
jl_fptr_trampoline at /Users/osx/buildbot/slave/package_osx64/build/src/gf.c:1829
read at /Users/ivo/.julia/packages/CSV/uLyo0/src/CSV.jl:294 [inlined]
read at /Users/ivo/.julia/packages/CSV/uLyo0/src/CSV.jl:294
jl_fptr_trampoline at /Users/osx/buildbot/slave/package_osx64/build/src/gf.c:1829
do_call at /Users/osx/buildbot/slave/package_osx64/build/src/interpreter.c:324
eval_stmt_value at /Users/osx/buildbot/slave/package_osx64/build/src/interpreter.c:363 [inlined]
eval_body at /Users/osx/buildbot/slave/package_osx64/build/src/interpreter.c:686
jl_interpret_toplevel_thunk_callback at /Users/osx/buildbot/slave/package_osx64/build/src/interpreter.c:799
unknown function (ip: 0xfffffffffffffffe)
unknown function (ip: 0x1093118bf)
unknown function (ip: 0x4)
jl_interpret_toplevel_thunk at /Users/osx/buildbot/slave/package_osx64/build/src/interpreter.c:808
jl_toplevel_eval_flex at /Users/osx/buildbot/slave/package_osx64/build/src/toplevel.c:787
jl_toplevel_eval_in at /Users/osx/buildbot/slave/package_osx64/build/src/builtins.c:622
eval at ./boot.jl:319
eval_user_input at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:85
macro expansion at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:117 [inlined]
#28 at ./task.jl:259
jl_apply at /Users/osx/buildbot/slave/package_osx64/build/src/./julia.h:1536 [inlined]
start_task at /Users/osx/buildbot/slave/package_osx64/build/src/task.c:268
Allocations: 39401485 (Pool: 39388832; Big: 12653); GC: 84
Segmentation fault: 11
2 Likes

Can you give more details about the CSV file? What’s the number of columns and their types? Could you provide a sample somewhere?

1 Like

Do you know if fread() is using multiple cores? That could easily explain the timing difference if CSV.read() is not using many cores.

Instead of using the Gzip package which is known to be quite slow, I recommend using CodecZlib (and TranscodingStreams). I have found it to work very well with almost no overhead handling compressed files.

4 Likes

I believe latest versions of R data.table do use multiple threads. @iwelch could you run getDTthreads() and run your timing with setDTthreads(1) for a fair comparison.

I do second the need for a text-column data file reader in julia with comparable speed to fread but it might be easier to build on their implementation rather that redo it. Lots of work went into fread.
I think that they are already porting it to python, so maybe we could try to get them interested in julia.

1 Like

I had to switch computers to my plain iMac 2017. the timing here is 41.27 seconds. Looking at my macos system monitor, the R process consumes 100% of the CPU, but only one CPU. the data.table version is 1.10.4, and it does not know any getDTthreads() function. so, I am going to guess this is a single thread.

the input file itself is

"permno","yyyymmdd","prc","vol","ret","shrout","openprc","numtrd","retx","vwretd","ewretd","eom"
10000,19860108,-2.5,12800,-0.02439,3680,NA,NA,-0.02439,-0.020744,-0.005117,0
10000,19860109,-2.5,1400,0,3680,NA,NA,0,-0.011219,-0.011588,0
10000,19860110,-2.5,8500,0,3680,NA,NA,0,0.000083,0.003651,0
10000,19860113,-2.625,5450,0.05,3680,NA,NA,0.05,0.002749,0.002433,0
10000,19860114,-2.75,2075,0.047619,3680,NA,NA,0.047619,0.000366,0.004474,0
10000,19860115,-2.875,22490,0.045455,3680,NA,NA,0.045455,0.008206,0.007693,0
10000,19860116,-3,10900,0.043478,3680,NA,NA,0.043478,0.004702,0.00567,0
10000,19860117,-3,8470,0,3680,NA,NA,0,-0.001741,0.003297,0
10000,19860120,-3,1000,0,3680,NA,NA,0,-0.003735,-0.001355,0
10000,19860121,-3,1000,0,3680,NA,NA,0,-0.006992,-0.003472,0
10000,19860122,-3,2700,0,3680,NA,NA,0,-0.009593,-0.004588,0
10000,19860123,-3.75,24000,0.25,3680,NA,NA,0.25,0.002664,0.001397,0
10000,19860124,-4.1875,11372,0.116667,3680,NA,NA,0.116667,0.009684,0.006771,0
10000,19860127,-4.4375,16570,0.059701,3680,NA,NA,0.059701,0.004343,0.00214,0
10000,19860128,-4.4375,9600,0,3680,NA,NA,0,0.009632,0.003179,0
10000,19860129,-4.3125,24505,-0.028169,3680,NA,NA,-0.028169,0.002445,-0.000248,0
10000,19860130,-4.4375,8600,0.028986,3680,NA,NA,0.028986,-0.003073,0.000895,0
10000,19860131,-4.375,4650,-0.014085,3680,NA,NA,-0.014085,0.009399,0.00539,1
10000,19860203,-4.375,3700,0,3680,NA,NA,0,0.008703,0.002844,0
...

but of course with another 90 million or so lines. Overall statistics are

                  mean           sd    tstat      NOK      NNA
permno      53440.3931   28732.7801  17538.0 88915607        0
yyyymmdd 19888762.6735  197981.5946 947266.0 88915607        0
prc            22.6829     913.5692    234.1 88915607        0
vol        337175.5352 3317740.5384    923.5 82564778  6350829
ret             0.0008       0.0430    175.4 88915607        0
shrout      38923.9254  212222.0158   1729.5 88915607        0
openprc        35.1874    1213.0264    205.7 50296418 38619189
numtrd       1012.6867    5928.4233    906.5 28164270 60751337
retx            0.0007       0.0431    153.2 88915607        0
vwretd          0.0004       0.0104    362.7 88915607        0
ewretd          0.0008       0.0090    838.2 88915607        0
eom             0.0475       0.2127   2105.8 88915607        0

a cat of the file to /dev/null from cache takes under a second. a conversion of 89 million strings to floats takes about 2-4 seconds. so about 25-50 seconds to convert all 12 fields. (a ‘wc’ takes about 15 seconds.)

fread() looks pretty good. alas, I have no idea how difficult it would be to adopt their code into julia. this is beyond me, as the main skillset here is presumably being able to interface their C to julia.

/iaw

julia> using CodecZlib, GZip

julia> @time x=read(GzipDecompressorStream(open("crspdaily-clean.csv.gz", "r")));
 29.880366 seconds (778.59 k allocations: 8.039 GiB, 0.66% gc time)

julia> @time x=read(GZip.gzopen("crspdaily-clean.csv.gz"))
116.542313 seconds (180.13 k allocations: 8.025 GiB, 0.07% gc time)

This was good advice.

4 Likes

OK. One way to make CSV.jl faster would be to pass allowmissing=:auto, since some of your columns have no missing values.

I rememeber at one point we tried using the Rcall package but efficiency was lost in the transfer from R to julia.

1 Like

I don’t know much about CSV.jl but it looks like you have to tell it explicitly that NA is missing as well, so something like:
CSV.read("filename",allowmissing=:auto,missingstring="NA")

Not actually slow but here at the bottom of the notebook you have a recent benchmark that shows that at least for this case CSVFiles.jl is tad faster.

R data.table is the fastest CSV reading tool I have ever used. It uses multiple threads and is written in C. I do not think you can achieve the performance of fread by writing a new CSV file parser in C from scratch. After all, data.table gets support from H2O with some full time big data programmers and has been optimized since a long time ago. There may be more efficient ways of doing that in a pure Julia style, but I doubt there is much room unexplored for performance improvement in such task. You can use RCall to use data.table::fread in Julia.

If there is no specific reason to stick with CSV files, I suggest you try data formats like hdf5, feather, and fst. They consume much less disk space, and has faster read/write speed than delimited text file.

Here is a comparison of read/write and compression efficiency among multiple packages in R. You can see that fst is the winner here. It uses multiple cores and depends on data.table. I normally retrieve data from CRSP or OptionMetrics in R via the Postgresql API provided by WRDS, and then write the data in fst format on disk for future use. I have tried all kinds of tools for such tasks, but fst is the fastest one.

2 Likes

I noticed people make arguments like this in Julia community. When package A is slower than package B, they would claim the comparison is not fair because package B uses multiple threads… Why not make package A also use multiple threads? I personally think the goal is to get best performance instead of winning a so called fair comparison game.

When some users post a question asking why a function in Matlab in faster than that in Julia, the first answer to expect here is that Matlab uses multiple threads and the comparison is unfair. I don’t think most users really care about the fairness, they just need the performance.

7 Likes

I don’t think anyone is saying “Let’s just do everything single threaded and always be slower than everything that’s multi-threaded”, I think rather the attitude is “Eventually we definitely have to make this multi-threaded, but right now it isn’t, so is there any reason I’m being slower than I need to be before I get to that point.” Usually when people are making these statements it’s not about competing in some kind of contest, it’s about looking for broken code.

6 Likes

Thanks I’ll take a look; CSV.write has had much more consideration lately for correctness than performance, so I’m not surprised it’s a bit slower here.

1 Like

Personally, I think it is just interesting to have both data points to figure out if there is a clear opportunity for Julia to do better right now.

3 Likes
  • my fread was not multi-threaded, and it was still 5 times faster than julia’s.

  • .csv.gz fits a specific need—it is the common interop format for many software packages and delivery format for a lot of data. yes, csv sucks. we still need it.

  • I would presume the native Julia Serialization, compressed, would be the way to go for fast read/write. for one, it understands Julia data types.

  • my problem writing experimental C code was about how to pass back a vector of vectors{Missing, T}. the rest does not seem too difficult for C. csv parsing seems one of the few tasks for which C seems well suited.