CSV.jl's CSV write seems slow

xiaodai · December 9, 2017, 4:21am

updated code with suggestions by ScottPJones

The below took about 100~200 seconds but in R’s data.table’s fwrite the same is done in 2 seconds. It is quite a big discrepancy. I am curious as to what need to change before CSV.jl can be as fast?

using DataFrames
const N = 100_000_000; const K = 100
srand(1)
@time df = DataFrame(idstr = rand(["id"*dec(k,10) for k in 1:(N÷K)], N)
        , id = rand(1:K, N)
        , val = rand(1:5,N))
using CSV
@time CSV.write("df.csv", df);

Tamas_Papp · December 9, 2017, 7:41am

If you would like to help in speeding it up, it would be best to profile and open specific issues.

Also, the MWE you are showing seems to lump in compilation time. Perhaps try

kristoffer.carlsson · December 9, 2017, 10:30am

If we are taking about workloads of hundreds of seconds I don’t think compilation time neither BenchmarkTools.jl are very relevant.

ScottPJones · December 9, 2017, 11:47am

What were the values of N and K for your tests?
Did you do the CSV.write twice, to make sure you didn’t include compilation time?
Also, I think dec(val, 3) would be better than using @sprintf for creating the id strings.

xiaodai · December 9, 2017, 11:55am

updated the original code, N is 100 million and K = 100

ScottPJones · December 9, 2017, 12:31pm

OK, on my machine (MacBookPro, 2.7GHz), and making sure GC has been run and everything compiled, I got:

julia> @time df = DataFrame(idstr = rand([@sprintf "id%03d" k for k in 1:(N/K)], N)
                , id = rand(1:K, N)
                , val = rand(1:5,N))
14.233183 seconds (15.56 M allocations: 2.704 GiB, 38.10% gc time)

and using ‘rand([“id”*dec(k,3) for k in 1:(N÷K)]’ (note using integer division also!):

13.178580 seconds (3.01 M allocations: 2.397 GiB, 32.94% gc time)

(the times are pretty consistent, using @sprintf makes the whole operation about 8% slower)

Just comparing the comprehension, the difference is much worse:

julia> f() = ["id"*dec(k,3) for k in 1:(N÷K)]
f (generic function with 1 method)

julia> g() = [@sprintf "id%03d" k for k in 1:(N/K)]
g (generic function with 1 method)

julia> gc(); gc(); @time f();
  0.149172 seconds (3.00 M allocations: 129.700 MiB, 45.45% gc time)

julia> gc(); gc(); @time g();
  2.519681 seconds (15.55 M allocations: 443.213 MiB, 74.81% gc time)

Almost 17x slower to use @sprintf in the comprehension!

I got similar times as you did for the CSV.write, it looks like this should be looked into (compilation times didn’t show up here, CSV is precompiled):

julia> @time CSV.write("df.csv", df);
211.388286 seconds (1.70 G allocations: 64.953 GiB, 15.65% gc time)

nalimilan · December 9, 2017, 4:45pm

Does the slowdown come from string columns or from integer columns (or from both)? I assume that’s with DataFrames 0.11?

xiaodai · December 10, 2017, 8:11am

I learned something new and I have applied your suggestion optimsations to my code. Thanks

xiaodai · December 10, 2017, 8:25am

I tried to remove the string column and the write is still slow at 100 seconds, so I think the speed is proportional to the size of the file to write.

using DataFrames
const N = 100_000_000; const K = 100
srand(1)
@time df = DataFrame( id = rand(1:K, N)
        , val = rand(1:5,N))
using CSV
@time CSV.write("df.csv", df);

Liso · December 10, 2017, 8:56pm

Some kind of strange deja vu …

tk3369 · December 11, 2017, 12:01am

For reference, my performance test (K = 100, N = 10_000_000) shows roughly 15x difference between CSV.jl and data.table. Is there any way to set a larger buffer size in CSV.Sink? I don’t know how though…

Config:

julia> versioninfo()
Julia Version 0.6.1
Commit 0d7248e (2017-10-24 22:15 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

Julia CSV.jl:

julia> @time K=100; N=10_000_000; df = DataFrame( id = rand(1:K, N), val = rand(1:5,N));
  0.000001 seconds (4 allocations: 160 bytes)

julia> @time CSV.write("df.csv", df);
 13.409986 seconds (80.00 M allocations: 2.505 GiB, 9.66% gc time)

julia> @time CSV.write("df.csv", df);
 13.341583 seconds (80.00 M allocations: 2.507 GiB, 9.24% gc time)

julia> @time CSV.write("df.csv", df);
 11.323750 seconds (80.00 M allocations: 2.507 GiB, 10.06% gc time)

julia> @time CSV.write("df.csv", df);
 11.166980 seconds (80.00 M allocations: 2.507 GiB, 8.82% gc time)

R data.table:

> K <- 100
> N <- 10000000
> dt <- data.table(id = sample(1:K, N, replace=TRUE), value = sample(1:5, N, replace=TRUE))
> system.time(fwrite(dt, "df2.csv"))
   user  system elapsed                                                                                                              
  0.244   0.656   2.314 
> system.time(fwrite(dt, "df2.csv"))
   user  system elapsed                                                                                                              
  0.245   0.500   3.363 
> system.time(fwrite(dt, "df2.csv"))
   user  system elapsed 
  0.260   0.046   0.491 
> system.time(fwrite(dt, "df2.csv"))
   user  system elapsed 
  0.301   0.086   0.717 
> system.time(fwrite(dt, "df2.csv"))
   user  system elapsed 
  0.252   0.140   0.729

File sizes are comparable:

shell> ls -l df*.csv
-rw-r--r--  1 tomkwong  staff  49200797 Dec 10 16:00 df.csv
-rw-r--r--  1 tomkwong  staff  49200288 Dec 10 15:42 df2.csv

y4lu · February 2, 2018, 5:40am

You may want to test this out, it’s usually pretty quick

function ch_reader(fil)
  fid = open(fil, "r");
  x = stat(fid).size;
  println("$fil chars $x");
  data = Array(Cuchar, x);
  data[:] = read(fid, Cuchar, x);
  close(fid);
  return(data);
  end;
 
  
function cleanload(fname)
  if(isfile("$(fname)"))
    passch = ch_reader("../input/$(fname)");
    else
    println("file not found!");
    return([]);
    end;
  passstr = String(passch);
  (xtrainx, xtrainh) = readdlm(IOBuffer(passstr), '\t', Any, '\n', header=true);
  println("$(size(xtrainx)) \n $xtrainh");
  return(xtrainx);
  end;

Or if csv is not so important and you have a fixed type

##-----------------
## speed loading with int32's

function xLoad_i32(fname)
  fid = open(fname, "r");
  x0 = read(fid, Int32, 1);
  x1 = read(fid, Int32, x0[1]);
  x2 = zeros(Int32, x1[1], x1[2]);
  x2[:] = read(fid, Int32, x1[1]*x1[2]);
  close(fid);
  return(x2);
  #xmat = reshape(x2, x1[1], x1[2]);
  end;
  
function xSave_i32(fname, mat)
  x = open(fname, "w");
  a = map(Int32, ndims(mat));
  write(x, a);
  for iter = 1:a
    b = size(mat, iter);
    write(x, map(Int32, b));
    end;
  write(x, map(Int32, mat[:]));
  close(x);
  end;

y4lu · February 2, 2018, 6:26am

Using the IOBuff pattern

function cleansave(data, fname)
  b = IOBuffer();
  writedlm(b, data); ##write to buffer first / while processing
  fid = open(fname, "w");
  seek(b, 0);
  write(fid, read(b)); ## then save
  close(fid);
  end;

seemed to be ~4x faster than a plain writecsv over a single rand(1000) vector, 0.2 sec v ~0.05

xiaodai · February 9, 2018, 5:33am

Need to remember to try this. But why doesn’t CSV.write just do this if it’s so much faster?

y4lu · February 9, 2018, 7:19am

Writecsv does seems to use a PipeBuffer(), so I probably didn’t test cleansave() well enough =)
Pretty sure cleanload() is/was a bit faster than readcsv though

xiaodai · February 10, 2018, 7:32am

The code doesn’t work, this line writedlm(b, data) givs error

ERROR: MethodError: no method matching start(::DataFrames.DataFrame)

y4lu · February 10, 2018, 1:08pm

Ah, i’m thinking of native arrays. It shouldn’t be too difficult to pull a dataframe into a type-Any matrix though?

##missing header
function df2mat(dataf)
  ncols = length(dataf.columns);
  nrows = length(dataf.columns[1]);
  ret = Array{Any}(nrows, ncols);
  rmap = x->ismissing(x)?"":x;
  for iter = 1:ncols
    ret[:,iter] = map(rmap, dataf.columns[iter]);
    end;
  return(ret);
  end;

y4lu · February 11, 2018, 2:28am

And for the header, i came up with

rhead = split(replace("$( keys(dataf) )", r"(Symbol)\[|\]|:", ""), ", ");
## return([rhead'; ret]);

bigfile.csv (480,471 KB) - 66.1 seconds with cleansave, size (10,000,000 x 2) - typo? should’ve been 100 mil x 2 i think

a datablobs prototype, using xSave_i32 (above)

testfile.dbb (~200Mb) - 23 sec with naive binary format, size (10 mil x 15) [5 columns of Int32s, Uint32s and strings]

davidanthoff · February 11, 2018, 3:13am

Have you tried CSVFiles.jl? I’d be curious how it stacks up, I have never benchmarked it

xiaodai · February 11, 2018, 3:15am

Is that the same as running FileIO.jl? See my benchmarks here

Topic		Replies	Views
First try seems a bit sluggish Performance	5	616	February 21, 2021
CSV parsing performance Data	4	1397	July 4, 2017
CSV read in is too slow than other language General Usage performance	13	1304	June 21, 2023
[ANN] Fread.jl - read CSVs faster with the help of R's {data.table} Package Announcements performance , data , csv	6	2042	October 9, 2019
Extremely slow CSV / IO? New to Julia csv , io	3	521	January 5, 2022

CSV.jl's CSV write seems slow

Related topics