CSV.jl's CSV write seems slow

davidanthoff · February 11, 2018, 5:10am

Yes, if you save as a CSV file with FileIO.jl, it will use CSVFiles.jl under the hood. TextParse.jl is actually not involved in that case, it only does the reading of files, I rolled the writing part of CSVFiles.jl myself.

Do I read the chart there correctly that the CSV writing stuff in CSVFiles.jl is the fastest way to write CSV files in julia right now? Yay I’m actually quite surprised that it is not way, way slower than the various binary options like Feather.jl, JLD.jl etc (yes, it is slower, but not orders of magnitude).

Now, the fwrite performance of course is crazy… How many cores do you have on your machine? I read the blog post how they do it, and I don’t think we could implement that kind of strategy with the current julia, we would really need a much stronger threading support…

xiaodai · February 11, 2018, 5:47am

I have 4 cores hyperthreaded. It’s a laptop high-end i7 CPU.

If you run the benchmark do you see the fwrite speeds that I quoted?

davidanthoff · February 11, 2018, 5:58am

I haven’t run the benchmarks. But fwrite makes use of your cores, so the more you have, the faster things should get. And yet, 4 cores is not that many, so it just seems really well done…

davidanthoff · February 11, 2018, 6:34am

Another interesting test would be fwrite with the nThread=1 option. That would switch off the use of multiple cores and would give us an idea how far we are away from a really fast serial implementation.

xiaodai · February 11, 2018, 6:43am

Actually Julia’s feather read and write are also slow. If given a choice would prefer to make those fast first!

xiaodai · February 11, 2018, 11:04pm

I think you are referring to this blog post?

Of course Julia’s threading isn’t as well developed but from I can see, it feels like Julia can implement some of it using IOBuffer? I have never done any of these but this is what I have in my mind:

The blogpost mentions writing N independent buffers and then writing out to disk once all buffers have finished writing sequentially

I think this can be simulated in Julia using this pseudo-code. I actually don’t know the right Julia syntax here

vio = Vector{IOBuffer}(nthreads())
# break "work" into chunks so that each chunk contains `nthreads()` pieces of work
work_chunks = breakup(work)
csvfile = open_file("path/to/out.csv")
for wc in work_chunks
  @threads for i=1:nthreads()
      local_io_buffer = new(IOBuffer())
      # write to local_io_buffer until full
      write_to_buffer!(local_io_buffer, wc[threadid()]
      vio[threadid()] = local_io_buffer
   end
   # by here each thread would have done some work; it could be the case that 1 thread has done two pieces of work but should be extremely rare
   write2csv(csvfile, vio)
end

The threads will take care of writing to its own buffer and there is a serial part to write it all out in order. This seems to be the approach mentioned in the post.

If the above was turned into proper Julia code, it might work. There is no obvious reason why it shouldn’t, I think; now it’s up to someone to spend the time to try…

davidanthoff · February 12, 2018, 4:30am

I’d be surprised if that construct gave the same performance characteristics that are described in the blog post. The OpenMP ordered structure is quite different from what you suggest above, and I believe quite a bit more efficient. My understanding is that the @threads macro really is best used with loops that have way more elements than you have cores, and then it distributes those loops over the cores. I’d be surprised if @threads performed well if you use it with loops that have as many elements as you have threads.

I believe that we’ll be able to do something similar to the strategy described in the blog once we have something like WIP: parallel task runtime by kpamnany · Pull Request #22631 · JuliaLang/julia · GitHub in julia.

kristoffer.carlsson · February 12, 2018, 9:58am

I’m pretty sure IO is not thread safe in julia.

xiaodai · February 12, 2018, 10:03am

So it’s not possible to achieve fwrite’s multithreaded speed then

davidanthoff · February 12, 2018, 3:37pm

I don’t think we need a thread safe IO system for that algorithm, the clue is that OpenMP in that example makes sure all IO is serialized. But we would need a richer threading story that supports more of the advanced OpenMP like stuff.

ScottPJones · February 12, 2018, 6:59pm

I think the more worrisome part for writing out CSV data, is that converting numbers to strings is not thread safe. The grisu code has a couple of things that would need to be locked, or have per-thread copies:
const DIGITS = Vector{UInt8}(uninitialized, 309+17)
and
const BIGNUMS = [Bignums.Bignum(),Bignums.Bignum(),Bignums.Bignum(),Bignums.Bignum()]
The grisu DIGITS buffer also seems to be reused in the Base printf code.

simonbyrne · February 12, 2018, 7:51pm

This is an open issue: Grisu (floating point printing) not thread-safe · Issue #25727 · JuliaLang/julia · GitHub

Juan · January 28, 2020, 7:09pm

I have tried the OP example and this is what I get:

julia> using CSV
julia> @btime CSV.write(“df.csv”, df);
22.298 s (200000047 allocations: 4.47 GiB)

julia> using CSVFiles
julia> @btime save(“df2.csv”, df)
120.250 s (400000103 allocations: 17.88 GiB)

It’s really slow.

Topic		Replies	Views
Multithreaded CSV writes Performance multithreading , csv	20	3467	April 14, 2023
CSV read in is too slow than other language General Usage performance	13	1371	June 21, 2023
CSV Reading (rewrite in C?) Internals & Design	50	5070	October 1, 2018
CSV : problem to write big dataframes Data csv	20	2812	May 29, 2023
Reading Data Is Still Too Slow Data	35	8824	August 2, 2019

CSV.jl's CSV write seems slow

Related topics