CSV Reading (rewrite in C?)

kevbonham · September 28, 2018, 10:56am

This had been discussed on slack before. I think it would be useful, but it’s not always practical since a lot of files have proprietary data and it’s not always obvious which feature of the table is causing issues.

But this file in particular seems straightforward, and others may be as well. Couldn’t hurt I’d say.

kevbonham · September 28, 2018, 10:59am

BioJulia has a repo like this for all the weird file formats used in bioinformatics GitHub - BioJulia/BioFmtSpecimens: A collection of bioinformatics file format specimens to test against

tshort · September 28, 2018, 12:26pm

Rather than storing big CSV files for reference data, it might work better to have julia code that generates CSV files with the desired characteristics. That would also make it easier to ramp the file size as desired.

iwelch · September 28, 2018, 2:50pm

I hve posted a test file “shuffled.csv.gz” at http://www.ivo-welch.info/unprofessional/shuffled.csv.gz , a very big 2GB file! It is to be used only for testing purposes, even though each column has been independently shuffled a few times beyond recognition. I will remove this file in a few days. anyone who wants to use it for CSV.read testing is welcome to download it ASAP. ungzip it.

Julia

julia> using CSV

julia> @time d=CSV.read("shuffled.csv");
180.171313 seconds (2.17 G allocations: 47.589 GiB, 20.86% gc time)

julia> GC.gc()

julia> @time d=CSV.read("shuffled.csv");
336.656964 seconds (2.13 G allocations: 46.798 GiB, 57.26% gc time)

julia> varinfo(r"d")
name       size summary
–––– –––––––––– ––––––––––––––––––––––––––––––––
d    10.886 GiB 88915607×12 DataFrames.DataFrame

warning before you play with this file, make sure you have a lot of memory. the file is 7GB uncompressed (which turns into 11GB uncompressed in d)! for running it a second time, it seems important first to remove the first-time d with d=nothing, or else your time will double. which is weird, because I have 25GB free. (is it really using 46GB memory at the same time? or just sequentially?) it should take just a teeny bit longer, but it does not. (Is it releasing and grabbing memory inefficiently??)

R

> library(data.table)

> t=Sys.time(); d=fread("shuffled.csv"); Sys.time() - t
Read 88915607 rows and 12 (of 12) columns from 6.202 GB file in 00:00:49
Time difference of 48.64 secs
> object.size(d)
6401926672 bytes

this is stable in terms of time on a second loading.

I hope this helps with CSV development. and thanks to the folks who have been working on it.

iwelch · September 30, 2018, 4:45pm

incidentally, there are more CSV readers sprinkled in other places:

base DelimitedFiles has readdlm and writedlm, but they do not work for data frames, only for arrays.
TextParse.csvread can read columns, with options for types.
there is also JuliaDB.loadtable

and probably a few more that I have forgotten.

csv reading and writing seems ripe for refactoring, with one super-efficient column reader and writer with options, presumably in DelimitedFiles (because it is base), and other versions that build on it.

Ajaychat3 · September 30, 2018, 5:04pm

Is it possible to apply rounding the float to one less significant digit when reading a csv file using csv.jl? If yes, how? Sorry if the question seems silly as I am quite new to Julia.

davidanthoff · September 30, 2018, 6:49pm

Both JuliaDB.loadtable and CSVFiles.jl use TextParse.jl under the hood, so in terms of performance and what files these packages can read, they should all be pretty equivalent. TextParse.jl is very fast on julia 0.6, but the current version on julia 1.0 has a major performance regression. I’m optimistic that it can be fixed and currently have a branch that improves things a fair bit, but until that is sorted out, you won’t see the great speed that we had on julia 0.6.

iwelch · September 30, 2018, 7:31pm

Is CSV’s codebase different or the same as CSVFiles?

If not, would it make sense to unify them?

nalimilan · September 30, 2018, 7:52pm

No, they’re completely different.

Yes, I think the plan is to improve CSV.jl so that it can replace TextParse efficiently in all cases. Help welcome.

davidanthoff · September 30, 2018, 8:11pm

CSVFiles.jl doesn’t have any parser, it really just provides TableTraits.jl and FileIO.jl integration, all the parsing is done in TextParse.jl. TextParse.jl and CSV.jl are separate codebases.

CSV.jl and TextParse.jl have such distinct internal designs that I don’t think it would make sense to unify them. Also, CSV.jl was really just rewritten from scratch.

quinnj · October 1, 2018, 4:13am

Yes, it wouldn’t be too hard to limit the # of digits read; but hopefully we could fix the performance issue without having to round.

Topic		Replies	Views
CSV read performance vs Pandas General Usage	29	8288	May 6, 2019
CSV Reader Benchmarks: Julia Reads CSVs 10-20x Faster than Python and R General Usage announcement	68	9180	March 23, 2022
Reading Data Is Still Too Slow Data	35	8967	August 2, 2019
CSV read in is too slow than other language General Usage performance	13	1443	June 21, 2023
My experiences reading CSVs from the Fannie Mae datasets Data performance , csv	62	6361	August 26, 2019

CSV Reading (rewrite in C?)

Julia

R

Related topics