CSV : problem to write big dataframes

Hi,
I have to read and write some big dataframes (100-500Mo). For a 80 Mo DataFrame, I need 220 sec to write it. And for a 500 Mo dataframe , after one hour waiting I gave up…

I use the following code :

julia> using CSV

julia> using DataFrames

julia> M = CSV.read("test.tsv"; delim='\t', header=true)
13893×1001 DataFrame. Omitted printing of 992 columns

julia> @time CSV.write("df.csv", M ; delim ='\t')
220.218516 seconds (13.83 M allocations: 880.227 MiB, 0.34% gc time)
"df.csv"

test.tsv

Hmmmm, that doesn’t seem great; is there any way you can share a sample of the DataFrame or at least the schema? I can try to figure out what’s causing the bottleneck.

What version of Julia are you using? I think CSV is being phased out in favor of DelimitedFiles. On my machine, using the file you have attached, I get:

julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, sandybridge)
Environment:
  JULIA = /project/soft/linux64/src/julia-1.0.0
julia> using DelimitedFiles
julia> @time x = readdlm("test.tsv");
 83.725932 seconds (396.76 M allocations: 9.211 GiB, 28.83% gc time)
julia> @time writedlm("test1.tsv", x)
  5.158386 seconds (1.54 M allocations: 152.232 MiB)
1 Like

Not at all, DelimitedFiles is just the same readdlm function which was provided by Julia Base in previous versions. It’s intended mainly at reading homogeneous arrays. CSV.jl is a more powerful package which should be used when reading tabular data (like data frames).

Note that there is also the (AFAIU) independent CSVFiles.jl.

Sorry, I forgot to say that I have my system up to date : Julia 1.0.1 with latest packages.

@quinnj at the end of my post there is a link to try the 80 Mo dataframe :slight_smile: The 500 Mo dataframe has the same outline, it is just bigger.

Surprisingly reading large dataframes is very much faster than writing them.

@tanhevg, thank you, when I have simple csv files, I use DelimitedFiles because I do know that it is very fast. But, for DataFrames, I use CSV.

Hi !

Today I did a test with the new Julia 1.0.2 and compared CSV and CSVFiles witht the “small” test.tsv file. Both are very slow. Just for fun I did a test with Julia 0.6.2 and it is much faster, so there is a 25x times regression speed to write the same data. So for big dataframes, I have no choice than staying with the old Julia 0.6 version :roll_eyes:

with Julia 1.0.2

julia> using DataFrames

julia> using CSV

julia> @time M = CSV.read("test.tsv"; delim='\t', header=true)
 11.363795 seconds (62.35 M allocations: 1.948 GiB, 7.47% gc time)
 13893×1001 DataFrame. Omitted printing of 992 columns

julia> @time CSV.write("df.csv", M ; delim ='\t')
206.409265 seconds (13.81 M allocations: 879.979 MiB, 0.17% gc time)
"df.csv"


julia> using DataFrames

julia> using CSVFiles

 @time M = DataFrame(load("test.tsv"))
 20.100596 seconds (67.96 M allocations: 3.183 GiB, 4.49% gc time)
13893×1001 DataFrame. Omitted printing of 992 columns

@time save("df2.tsv", M)
220.085667 seconds (20.76 M allocations: 1.296 GiB, 0.24% gc time)

with Julia 0.6.2



               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using DataFrames

julia> using CSV
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /home/fred/.julia/v0.6/NullableArrays/src/operators.jl:99.

julia> @time M = CSV.read("test.tsv"; delim='\t', header=true);
 14.923997 seconds (48.01 M allocations: 1.289 GiB, 1.79% gc time)
 
 julia> @time CSV.write("df.csv", M ; delim ='\t')
  9.370022 seconds (82.44 M allocations: 1.690 GiB, 3.53% gc time)
1 Like

Looks like it could be missing or limited by the read buffer in readdlm, similar for the write buffer in CSV write.
Is this an OS thing, either buffering one or the other?

readf = open("test.tsv", "r")
readbuff = IOBuffer(read(readf));  ##timing point read1 (from Disk)
close(readf);
x = readdlm(readbuff); ##timing point read2 (from memory)

x2 = DataFrame(x);

writebuff = IOBuffer();
CSV.write(writebuff, x2; delim = '\t'); ##timing point write 1 (to memory)
writef = open("testout.tsv" , "w");
write(writef, take!(writebuff)); ## ##timing point write 2 (to disk)
close(writef);

@fred
couple of fixes, but either half should work on it’s own - eg writebuff and so on paired with a CSV.read table
Try using your M in place of the x2 above

Thanks @y4lu. I did not manage to test your code except for reading the file.

julia> using DelimitedFiles

julia> readf = open("test.tsv")
IOStream(<file test.tsv>)

julia> readbuff = IOBuffer(read(readf));

julia> close(readf);

julia> @time x = readdlm(readbuff);
  6.580607 seconds (64.12 M allocations: 1.832 GiB, 8.59% gc time)

julia> writebuff = IOBuffer();
julia> using CSV

julia> @time CSV.write(writebuff, x; delim = '\t');
ERROR: ArgumentError: no default `Tables.rows` implementation for type: Array{Any,2}

julia> writef = open("testout.tsv");
ERROR: SystemError: opening file testout.tsv: Aucun fichier ou dossier de ce type

I wonder whether this is somehow related to the relatively large number of columns… Both CSVFiles.jl and CSV.jl will process this data as iterators of NamedTuple (i.e. as rows), and I’ve never been sure whether there is maybe some cut-off where a named tuple with a very large number of fields gets inefficient?

To write my big DataFrames, I found this solution, it is not elegant but almots 10x faster :

julia> @time printDF(M)
write results
 31.063968 seconds (129.47 M allocations: 42.875 GiB, 23.42% gc time)

function printDF(df)
    println("write results")
    open("df.csv", "w") do f
        write(f, join(names(M), "\t") * "\n") # print header
        for row in 1:size(df)[1]
            line = df[row,1]
            for col in 2:size(df)[2]
                if typeof(df[row, col]) == String 
                    c = df[row, col] 
                else 
                    c = string(df[row, col])
                end
                line = line * "\t" * c  # merge all cells of one row
            end
            write(f, "$line\n") # print df line by line
        end
    end
end
2 Likes

@Fred I think the idea should that you didn’t have to do this to get speed. It should already be in a package. Maybe contribute this version back

@xiaodai I completely agree with you but my solution is not good enough to be in a package, because it assumes that the dataframe as only 2 types : strings and floats (it even assumes that the row names are strings)

You can also do show(f, MIME("text/csv"), df). This calls a simple function similar to the custom one you showed above.

Finally, what is the suggested package/function to read and write large datasets containing tabular data?

@Juan read is not a problem, but write can be too long (more than one hour in my case) if you have many columns. So the solution is to use a custom function such as mine or @nalimilan.

If you don’t need DataFrames, I recommend you to use readdlm / writedlm, it is very fast.

julia> using DelimitedFiles
julia> open("delim_file.txt", "w") do io
             writedlm(io, [x y])
         end

julia> readdlm("delim_file.txt", '\t', Int, '\n')

It’s strange, on R’s data.table is the opposite, it needs much more time to read because it first needs to figure out the structure and what type of data are contained on each column and it needs to deal with missings.

Hey,

This is a niche problem that not many would face. But I am facing this same problem. I was using CSV.jl to save dataframes and it was going fine. My scientific simulation has huge dataframes. (A Million rows) I was able to save because each element was not that big. But my latest simulation made a dataframe that was 3 GB in the RAM. CSV.write just was not able to save it and I waited for a couple of hours.

i have already optimised my data and I need all of it to do the next step.

I was wondering if there is any way to speed this up or any new library that would be able to reduce my time.

I do know the alternative way would be to write the row/data as it is created. But that would involve me writing the code again and I would like to have a shorter way.

Thanks.

Please make an MWE that reproduces the problem, and ideally report it as an issue in the CSV repository. I cannot reproduce it with a simple example, eg

using CSV, DataFrames
N = 10^8
df = DataFrame(; a = rand(Int, N), b = rand(Float64, N), c = rand('a':'z', N))
dst = "/tmp/data.csv"
@time CSV.write(dst, df)
@show filesize(dst) / 2^30 # file size in Gb

takes 45s on my laptop, creating an almost 4GB file (using an SSD). I find this reasonable.