CSV Reading (rewrite in C?)

This had been discussed on slack before. I think it would be useful, but it’s not always practical since a lot of files have proprietary data and it’s not always obvious which feature of the table is causing issues.

But this file in particular seems straightforward, and others may be as well. Couldn’t hurt I’d say.

BioJulia has a repo like this for all the weird file formats used in bioinformatics https://github.com/BioJulia/BioFmtSpecimens

1 Like

Rather than storing big CSV files for reference data, it might work better to have julia code that generates CSV files with the desired characteristics. That would also make it easier to ramp the file size as desired.

I hve posted a test file “shuffled.csv.gz” at http://www.ivo-welch.info/unprofessional/shuffled.csv.gz , a very big 2GB file! It is to be used only for testing purposes, even though each column has been independently shuffled a few times beyond recognition. I will remove this file in a few days. anyone who wants to use it for CSV.read testing is welcome to download it ASAP. ungzip it.

Julia

julia> using CSV

julia> @time d=CSV.read("shuffled.csv");
180.171313 seconds (2.17 G allocations: 47.589 GiB, 20.86% gc time)

julia> GC.gc()

julia> @time d=CSV.read("shuffled.csv");
336.656964 seconds (2.13 G allocations: 46.798 GiB, 57.26% gc time)

julia> varinfo(r"d")
name       size summary
–––– –––––––––– ––––––––––––––––––––––––––––––––
d    10.886 GiB 88915607×12 DataFrames.DataFrame

warning before you play with this file, make sure you have a lot of memory. the file is 7GB uncompressed (which turns into 11GB uncompressed in d)! for running it a second time, it seems important first to remove the first-time d with d=nothing, or else your time will double. which is weird, because I have 25GB free. (is it really using 46GB memory at the same time? or just sequentially?) it should take just a teeny bit longer, but it does not. (Is it releasing and grabbing memory inefficiently??)

R

> library(data.table)

> t=Sys.time(); d=fread("shuffled.csv"); Sys.time() - t
Read 88915607 rows and 12 (of 12) columns from 6.202 GB file in 00:00:49
Time difference of 48.64 secs
> object.size(d)
6401926672 bytes

this is stable in terms of time on a second loading.

I hope this helps with CSV development. and thanks to the folks who have been working on it.

3 Likes

incidentally, there are more CSV readers sprinkled in other places:

  • base DelimitedFiles has readdlm and writedlm, but they do not work for data frames, only for arrays.

  • TextParse.csvread can read columns, with options for types.

  • there is also JuliaDB.loadtable

and probably a few more that I have forgotten.

csv reading and writing seems ripe for refactoring, with one super-efficient column reader and writer with options, presumably in DelimitedFiles (because it is base), and other versions that build on it.

1 Like

Is it possible to apply rounding the float to one less significant digit when reading a csv file using csv.jl? If yes, how? Sorry if the question seems silly as I am quite new to Julia.

Both JuliaDB.loadtable and CSVFiles.jl use TextParse.jl under the hood, so in terms of performance and what files these packages can read, they should all be pretty equivalent. TextParse.jl is very fast on julia 0.6, but the current version on julia 1.0 has a major performance regression. I’m optimistic that it can be fixed and currently have a branch that improves things a fair bit, but until that is sorted out, you won’t see the great speed that we had on julia 0.6.

2 Likes

Is CSV’s codebase different or the same as CSVFiles?

If not, would it make sense to unify them?

No, they’re completely different.

Yes, I think the plan is to improve CSV.jl so that it can replace TextParse efficiently in all cases. Help welcome.

CSVFiles.jl doesn’t have any parser, it really just provides TableTraits.jl and FileIO.jl integration, all the parsing is done in TextParse.jl. TextParse.jl and CSV.jl are separate codebases.

CSV.jl and TextParse.jl have such distinct internal designs that I don’t think it would make sense to unify them. Also, CSV.jl was really just rewritten from scratch.

Yes, it wouldn’t be too hard to limit the # of digits read; but hopefully we could fix the performance issue without having to round.