CSV.jl's CSV write seems slow

updated code with suggestions by ScottPJones

The below took about 100~200 seconds but in R’s data.table’s fwrite the same is done in 2 seconds. It is quite a big discrepancy. I am curious as to what need to change before CSV.jl can be as fast?

using DataFrames
const N = 100_000_000; const K = 100
srand(1)
@time df = DataFrame(idstr = rand(["id"*dec(k,10) for k in 1:(N÷K)], N)
        , id = rand(1:K, N)
        , val = rand(1:5,N))
using CSV
@time CSV.write("df.csv", df);

If you would like to help in speeding it up, it would be best to profile and open specific issues.

Also, the MWE you are showing seems to lump in compilation time. Perhaps try

1 Like

If we are taking about workloads of hundreds of seconds I don’t think compilation time neither BenchmarkTools.jl are very relevant.

1 Like

What were the values of N and K for your tests?
Did you do the CSV.write twice, to make sure you didn’t include compilation time?
Also, I think dec(val, 3) would be better than using @sprintf for creating the id strings.

updated the original code, N is 100 million and K = 100

OK, on my machine (MacBookPro, 2.7GHz), and making sure GC has been run and everything compiled, I got:

julia> @time df = DataFrame(idstr = rand([@sprintf "id%03d" k for k in 1:(N/K)], N)
                , id = rand(1:K, N)
                , val = rand(1:5,N))
14.233183 seconds (15.56 M allocations: 2.704 GiB, 38.10% gc time)

and using ‘rand([“id”*dec(k,3) for k in 1:(N÷K)]’ (note using integer division also!):

13.178580 seconds (3.01 M allocations: 2.397 GiB, 32.94% gc time)

(the times are pretty consistent, using @sprintf makes the whole operation about 8% slower)

Just comparing the comprehension, the difference is much worse:

julia> f() = ["id"*dec(k,3) for k in 1:(N÷K)]
f (generic function with 1 method)

julia> g() = [@sprintf "id%03d" k for k in 1:(N/K)]
g (generic function with 1 method)

julia> gc(); gc(); @time f();
  0.149172 seconds (3.00 M allocations: 129.700 MiB, 45.45% gc time)

julia> gc(); gc(); @time g();
  2.519681 seconds (15.55 M allocations: 443.213 MiB, 74.81% gc time)

Almost 17x slower to use @sprintf in the comprehension!

I got similar times as you did for the CSV.write, it looks like this should be looked into (compilation times didn’t show up here, CSV is precompiled):

julia> @time CSV.write("df.csv", df);
211.388286 seconds (1.70 G allocations: 64.953 GiB, 15.65% gc time)
1 Like

Does the slowdown come from string columns or from integer columns (or from both)? I assume that’s with DataFrames 0.11?

I learned something new and I have applied your suggestion optimsations to my code. Thanks

I tried to remove the string column and the write is still slow at 100 seconds, so I think the speed is proportional to the size of the file to write.

using DataFrames
const N = 100_000_000; const K = 100
srand(1)
@time df = DataFrame( id = rand(1:K, N)
        , val = rand(1:5,N))
using CSV
@time CSV.write("df.csv", df);

Some kind of strange deja vu

For reference, my performance test (K = 100, N = 10_000_000) shows roughly 15x difference between CSV.jl and data.table. Is there any way to set a larger buffer size in CSV.Sink? I don’t know how though…

Config:

julia> versioninfo()
Julia Version 0.6.1
Commit 0d7248e (2017-10-24 22:15 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

Julia CSV.jl:

julia> @time K=100; N=10_000_000; df = DataFrame( id = rand(1:K, N), val = rand(1:5,N));
  0.000001 seconds (4 allocations: 160 bytes)

julia> @time CSV.write("df.csv", df);
 13.409986 seconds (80.00 M allocations: 2.505 GiB, 9.66% gc time)

julia> @time CSV.write("df.csv", df);
 13.341583 seconds (80.00 M allocations: 2.507 GiB, 9.24% gc time)

julia> @time CSV.write("df.csv", df);
 11.323750 seconds (80.00 M allocations: 2.507 GiB, 10.06% gc time)

julia> @time CSV.write("df.csv", df);
 11.166980 seconds (80.00 M allocations: 2.507 GiB, 8.82% gc time)

R data.table:

> K <- 100
> N <- 10000000
> dt <- data.table(id = sample(1:K, N, replace=TRUE), value = sample(1:5, N, replace=TRUE))
> system.time(fwrite(dt, "df2.csv"))
   user  system elapsed                                                                                                              
  0.244   0.656   2.314 
> system.time(fwrite(dt, "df2.csv"))
   user  system elapsed                                                                                                              
  0.245   0.500   3.363 
> system.time(fwrite(dt, "df2.csv"))
   user  system elapsed 
  0.260   0.046   0.491 
> system.time(fwrite(dt, "df2.csv"))
   user  system elapsed 
  0.301   0.086   0.717 
> system.time(fwrite(dt, "df2.csv"))
   user  system elapsed 
  0.252   0.140   0.729 

File sizes are comparable:

shell> ls -l df*.csv
-rw-r--r--  1 tomkwong  staff  49200797 Dec 10 16:00 df.csv
-rw-r--r--  1 tomkwong  staff  49200288 Dec 10 15:42 df2.csv
1 Like

You may want to test this out, it’s usually pretty quick

function ch_reader(fil)
  fid = open(fil, "r");
  x = stat(fid).size;
  println("$fil chars $x");
  data = Array(Cuchar, x);
  data[:] = read(fid, Cuchar, x);
  close(fid);
  return(data);
  end;
 
  
function cleanload(fname)
  if(isfile("$(fname)"))
    passch = ch_reader("../input/$(fname)");
    else
    println("file not found!");
    return([]);
    end;
  passstr = String(passch);
  (xtrainx, xtrainh) = readdlm(IOBuffer(passstr), '\t', Any, '\n', header=true);
  println("$(size(xtrainx)) \n $xtrainh");
  return(xtrainx);
  end;

Or if csv is not so important and you have a fixed type

##-----------------
## speed loading with int32's

function xLoad_i32(fname)
  fid = open(fname, "r");
  x0 = read(fid, Int32, 1);
  x1 = read(fid, Int32, x0[1]);
  x2 = zeros(Int32, x1[1], x1[2]);
  x2[:] = read(fid, Int32, x1[1]*x1[2]);
  close(fid);
  return(x2);
  #xmat = reshape(x2, x1[1], x1[2]);
  end;
  
function xSave_i32(fname, mat)
  x = open(fname, "w");
  a = map(Int32, ndims(mat));
  write(x, a);
  for iter = 1:a
    b = size(mat, iter);
    write(x, map(Int32, b));
    end;
  write(x, map(Int32, mat[:]));
  close(x);
  end;

Using the IOBuff pattern

function cleansave(data, fname)
  b = IOBuffer();
  writedlm(b, data); ##write to buffer first / while processing
  fid = open(fname, "w");
  seek(b, 0);
  write(fid, read(b)); ## then save
  close(fid);
  end;

seemed to be ~4x faster than a plain writecsv over a single rand(1000) vector, 0.2 sec v ~0.05

1 Like

Need to remember to try this. But why doesn’t CSV.write just do this if it’s so much faster?

Writecsv does seems to use a PipeBuffer(), so I probably didn’t test cleansave() well enough =)
Pretty sure cleanload() is/was a bit faster than readcsv though

The code doesn’t work, this line writedlm(b, data) givs error

ERROR: MethodError: no method matching start(::DataFrames.DataFrame)

Ah, i’m thinking of native arrays. It shouldn’t be too difficult to pull a dataframe into a type-Any matrix though?

##missing header
function df2mat(dataf)
  ncols = length(dataf.columns);
  nrows = length(dataf.columns[1]);
  ret = Array{Any}(nrows, ncols);
  rmap = x->ismissing(x)?"":x;
  for iter = 1:ncols
    ret[:,iter] = map(rmap, dataf.columns[iter]);
    end;
  return(ret);
  end;

And for the header, i came up with

rhead = split(replace("$( keys(dataf) )", r"(Symbol)\[|\]|:", ""), ", ");
## return([rhead'; ret]);

bigfile.csv (480,471 KB) - 66.1 seconds with cleansave, size (10,000,000 x 2) - typo? should’ve been 100 mil x 2 i think

a datablobs prototype, using xSave_i32 (above)

testfile.dbb (~200Mb) - 23 sec with naive binary format, size (10 mil x 15) [5 columns of Int32s, Uint32s and strings]

Have you tried CSVFiles.jl? I’d be curious how it stacks up, I have never benchmarked it :slight_smile:

Is that the same as running FileIO.jl? See my benchmarks here