Questions about csv(How to write to csv faster)

I have a huge dataframe which has 26M rows.How do I write it to a csv file faster and comsume fewer memories?

julia> pdpdpd=vcat(pd,pd,pd)
26590302×6 DataFrame
      Row │ chr     position   pdr      discordant  sum    allsum
          │ String  Int64      Float64  Int64       Int64  Int64
──────────┼───────────────────────────────────────────────────────
        1 │ chr1        10542      0.0           0      1       1
        2 │ chr1        10563      0.0           0      1       1
        3 │ chr1        10571      0.0           0      1       1
        4 │ chr1        10577      0.0           0      1       1
        5 │ chr1        10579      0.0           0      1       1
        6 │ chr1        10589      0.0           0      1       1
        7 │ chr1        10609      1.0           1      1       1
        8 │ chr1        10617      1.0           1      1       1
        9 │ chr1        10620      1.0           1      1       1
       10 │ chr1        10633      1.0           1      1       1
       11 │ chr1        10636      1.0           1      1       1
       12 │ chr1        10638      1.0           1      1       1
       13 │ chr1        10641      1.0           1      1       1
       14 │ chr1        10644      1.0           1      1       1
       15 │ chr1        10650      1.0           1      1       1
       16 │ chr1        10660      1.0           1      1       1
       17 │ chr1        10662      1.0           1      1       1
       18 │ chr1        10665      1.0           1      1       1
       19 │ chr1        10667      1.0           1      1       1
       20 │ chr1        10670      1.0           1      1       1
       21 │ chr1        13303    NaN             0      0       1
       22 │ chr1        13668    NaN             0      0       1
       23 │ chr1        13694    NaN             0      0       1
    ⋮     │   ⋮         ⋮         ⋮         ⋮         ⋮      ⋮
 26590281 │ chr1    248944853      1.0           1      1       1
 26590282 │ chr1    248944864      1.0           1      1       1
 26590283 │ chr1    248944868      1.0           1      1       1
 26590284 │ chr1    248944875      1.0           1      1       1
 26590285 │ chr1    248944889      1.0           1      1       1
 26590286 │ chr1    248944897      1.0           1      1       1
 26590287 │ chr1    248944900      1.0           1      1       1

julia> @time CSV.write("a.csv",pdpdpd)
 40.550157 seconds (718.14 M allocations: 17.845 GiB, 11.81% gc time, 0.28% compilation time)
"a.csv"

It took a lot of time,and cosumes large memeories.What should i do?Thanks.

Does it have to be CSV (because you send it somewhere that needs a CSV) or do you just want to write the data and read it quicker? If it can be something else consider writing an Arrow file using Arrow.jl

1 Like

Sorry,I just want to generate a csv.How to be faster?

So allocations and GC were not a huge bottleneck here. 12% isn’t nothing but it’s not a very attractive target for optimization.

I just back of enveloped that 26M rows at 100 bytes per row over 40 seconds gives you around 65MB/s which depending on your hard drive speed could mean hardware is the bottleneck… Probably not if you have a SSD but spinning drives might top out in that range. Or at least maybe at 2x that speed. So I think you should benchmark your hardware first before trying to make things faster.

Once you know your hardware limits, then compare your throughout to that.

A useful test is to use dd on Linux.

2 Likes

If you need to produce a CSV because that’s what the consumer needs… Then you have no choice. That’s why I asked about that.

You CAN read an individual CSV file using parallelism (I believe it’s done already), but I’m not sure you can write them in parallel in general (or at least that Julia packages do).

Does Julia at least do that with Arrow.jl (or some other formats)? I believe some formats are meant for that and just googling, while CSV wasn’t made for that, I think it could be done (at least for your file, looking at its data, assuming it’s typical): how to write content of df into csv file using multiprocessing in python - splunktool
Writing to CSV-file from multiple threads
CSV / File writer conflict during parallel execution - KNIME Analytics Platform - KNIME Community Forum

If you have an SSD then writing (or reading) in parallel could help, but for a spinning disk will likely kill performance.

how are they gonna use it, Excel??

R Python Stata SAS some custom C code, COBOL legacy systems whatever

1 Like

except SAS and COBOL I’m sure all of them can read .arrow file.

Also, consider not adding Julia into this mix, this sounds terrible LOL

When you control both the writer and the reader, you can choose format. When the reader is out of your control… like it’s a different department and they don’t want to change their code, or it’s a customer or whatever… you provide what the reader needs.

I agree with you that Arrow files are better for fast reading and writing. But it’s not always an option to change formats.

2 Likes