Questions about csv（How to write to csv faster）

zhangchunyong · October 25, 2022, 1:12pm

I have a huge dataframe which has 26M rows.How do I write it to a csv file faster and comsume fewer memories?

julia> pdpdpd=vcat(pd,pd,pd)
26590302×6 DataFrame
      Row │ chr     position   pdr      discordant  sum    allsum
          │ String  Int64      Float64  Int64       Int64  Int64
──────────┼───────────────────────────────────────────────────────
        1 │ chr1        10542      0.0           0      1       1
        2 │ chr1        10563      0.0           0      1       1
        3 │ chr1        10571      0.0           0      1       1
        4 │ chr1        10577      0.0           0      1       1
        5 │ chr1        10579      0.0           0      1       1
        6 │ chr1        10589      0.0           0      1       1
        7 │ chr1        10609      1.0           1      1       1
        8 │ chr1        10617      1.0           1      1       1
        9 │ chr1        10620      1.0           1      1       1
       10 │ chr1        10633      1.0           1      1       1
       11 │ chr1        10636      1.0           1      1       1
       12 │ chr1        10638      1.0           1      1       1
       13 │ chr1        10641      1.0           1      1       1
       14 │ chr1        10644      1.0           1      1       1
       15 │ chr1        10650      1.0           1      1       1
       16 │ chr1        10660      1.0           1      1       1
       17 │ chr1        10662      1.0           1      1       1
       18 │ chr1        10665      1.0           1      1       1
       19 │ chr1        10667      1.0           1      1       1
       20 │ chr1        10670      1.0           1      1       1
       21 │ chr1        13303    NaN             0      0       1
       22 │ chr1        13668    NaN             0      0       1
       23 │ chr1        13694    NaN             0      0       1
    ⋮     │   ⋮         ⋮         ⋮         ⋮         ⋮      ⋮
 26590281 │ chr1    248944853      1.0           1      1       1
 26590282 │ chr1    248944864      1.0           1      1       1
 26590283 │ chr1    248944868      1.0           1      1       1
 26590284 │ chr1    248944875      1.0           1      1       1
 26590285 │ chr1    248944889      1.0           1      1       1
 26590286 │ chr1    248944897      1.0           1      1       1
 26590287 │ chr1    248944900      1.0           1      1       1

julia> @time CSV.write("a.csv",pdpdpd)
 40.550157 seconds (718.14 M allocations: 17.845 GiB, 11.81% gc time, 0.28% compilation time)
"a.csv"

It took a lot of time,and cosumes large memeories.What should i do?Thanks.

dlakelan · October 25, 2022, 2:03pm

Does it have to be CSV (because you send it somewhere that needs a CSV) or do you just want to write the data and read it quicker? If it can be something else consider writing an Arrow file using Arrow.jl

zhangchunyong · October 26, 2022, 11:57am

Sorry，I just want to generate a csv.How to be faster?

dlakelan · October 26, 2022, 2:14pm

So allocations and GC were not a huge bottleneck here. 12% isn’t nothing but it’s not a very attractive target for optimization.

I just back of enveloped that 26M rows at 100 bytes per row over 40 seconds gives you around 65MB/s which depending on your hard drive speed could mean hardware is the bottleneck… Probably not if you have a SSD but spinning drives might top out in that range. Or at least maybe at 2x that speed. So I think you should benchmark your hardware first before trying to make things faster.

Once you know your hardware limits, then compare your throughout to that.

A useful test is to use dd on Linux.

dlakelan · October 26, 2022, 2:22pm

If you need to produce a CSV because that’s what the consumer needs… Then you have no choice. That’s why I asked about that.

Palli · October 26, 2022, 2:39pm

You CAN read an individual CSV file using parallelism (I believe it’s done already), but I’m not sure you can write them in parallel in general (or at least that Julia packages do).

Does Julia at least do that with Arrow.jl (or some other formats)? I believe some formats are meant for that and just googling, while CSV wasn’t made for that, I think it could be done (at least for your file, looking at its data, assuming it’s typical): how to write content of df into csv file using multiprocessing in python - splunktool
Writing to CSV-file from multiple threads
CSV / File writer conflict during parallel execution - KNIME Analytics Platform - KNIME Community Forum

If you have an SSD then writing (or reading) in parallel could help, but for a spinning disk will likely kill performance.

jling · October 26, 2022, 5:00pm

how are they gonna use it, Excel??

dlakelan · October 26, 2022, 5:07pm

R Python Stata SAS some custom C code, COBOL legacy systems whatever

jling · October 26, 2022, 6:17pm

except SAS and COBOL I’m sure all of them can read .arrow file.

Also, consider not adding Julia into this mix, this sounds terrible LOL

dlakelan · October 26, 2022, 6:21pm

When you control both the writer and the reader, you can choose format. When the reader is out of your control… like it’s a different department and they don’t want to change their code, or it’s a customer or whatever… you provide what the reader needs.

I agree with you that Arrow files are better for fast reading and writing. But it’s not always an option to change formats.

Topic		Replies	Views
CSV : problem to write big dataframes Data csv	20	2811	May 29, 2023
Multithreaded CSV writes Performance multithreading , csv	20	3454	April 14, 2023
CSV.jl's CSV write seems slow Performance	32	5738	January 28, 2020
First try seems a bit sluggish Performance	5	619	February 21, 2021
CSV Reading (rewrite in C?) Internals & Design	50	5057	October 1, 2018

Questions about csv（How to write to csv faster）

Related topics