CSV : problem to write big dataframes

csv

#1

Hi,
I have to read and write some big dataframes (100-500Mo). For a 80 Mo DataFrame, I need 220 sec to write it. And for a 500 Mo dataframe , after one hour waiting I gave up…

I use the following code :

julia> using CSV

julia> using DataFrames

julia> M = CSV.read("test.tsv"; delim='\t', header=true)
13893×1001 DataFrame. Omitted printing of 992 columns

julia> @time CSV.write("df.csv", M ; delim ='\t')
220.218516 seconds (13.83 M allocations: 880.227 MiB, 0.34% gc time)
"df.csv"

test.tsv


#2

Hmmmm, that doesn’t seem great; is there any way you can share a sample of the DataFrame or at least the schema? I can try to figure out what’s causing the bottleneck.


#3

What version of Julia are you using? I think CSV is being phased out in favor of DelimitedFiles. On my machine, using the file you have attached, I get:

julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, sandybridge)
Environment:
  JULIA = /project/soft/linux64/src/julia-1.0.0
julia> using DelimitedFiles
julia> @time x = readdlm("test.tsv");
 83.725932 seconds (396.76 M allocations: 9.211 GiB, 28.83% gc time)
julia> @time writedlm("test1.tsv", x)
  5.158386 seconds (1.54 M allocations: 152.232 MiB)

#4

Not at all, DelimitedFiles is just the same readdlm function which was provided by Julia Base in previous versions. It’s intended mainly at reading homogeneous arrays. CSV.jl is a more powerful package which should be used when reading tabular data (like data frames).


#5

Note that there is also the (AFAIU) independent CSVFiles.jl.


#6

Sorry, I forgot to say that I have my system up to date : Julia 1.0.1 with latest packages.

@quinnj at the end of my post there is a link to try the 80 Mo dataframe :slight_smile: The 500 Mo dataframe has the same outline, it is just bigger.

Surprisingly reading large dataframes is very much faster than writing them.


#7

@tanhevg, thank you, when I have simple csv files, I use DelimitedFiles because I do know that it is very fast. But, for DataFrames, I use CSV.


#8

Hi !

Today I did a test with the new Julia 1.0.2 and compared CSV and CSVFiles witht the “small” test.tsv file. Both are very slow. Just for fun I did a test with Julia 0.6.2 and it is much faster, so there is a 25x times regression speed to write the same data. So for big dataframes, I have no choice than staying with the old Julia 0.6 version :roll_eyes:

with Julia 1.0.2

julia> using DataFrames

julia> using CSV

julia> @time M = CSV.read("test.tsv"; delim='\t', header=true)
 11.363795 seconds (62.35 M allocations: 1.948 GiB, 7.47% gc time)
 13893×1001 DataFrame. Omitted printing of 992 columns

julia> @time CSV.write("df.csv", M ; delim ='\t')
206.409265 seconds (13.81 M allocations: 879.979 MiB, 0.17% gc time)
"df.csv"


julia> using DataFrames

julia> using CSVFiles

 @time M = DataFrame(load("test.tsv"))
 20.100596 seconds (67.96 M allocations: 3.183 GiB, 4.49% gc time)
13893×1001 DataFrame. Omitted printing of 992 columns

@time save("df2.tsv", M)
220.085667 seconds (20.76 M allocations: 1.296 GiB, 0.24% gc time)

with Julia 0.6.2



               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using DataFrames

julia> using CSV
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /home/fred/.julia/v0.6/NullableArrays/src/operators.jl:99.

julia> @time M = CSV.read("test.tsv"; delim='\t', header=true);
 14.923997 seconds (48.01 M allocations: 1.289 GiB, 1.79% gc time)
 
 julia> @time CSV.write("df.csv", M ; delim ='\t')
  9.370022 seconds (82.44 M allocations: 1.690 GiB, 3.53% gc time)

#9

Looks like it could be missing or limited by the read buffer in readdlm, similar for the write buffer in CSV write.
Is this an OS thing, either buffering one or the other?

readf = open("test.tsv", "r")
readbuff = IOBuffer(read(readf));  ##timing point read1 (from Disk)
close(readf);
x = readdlm(readbuff); ##timing point read2 (from memory)

x2 = DataFrame(x);

writebuff = IOBuffer();
CSV.write(writebuff, x2; delim = '\t'); ##timing point write 1 (to memory)
writef = open("testout.tsv" , "w");
write(writef, take!(writebuff)); ## ##timing point write 2 (to disk)
close(writef);

@fred
couple of fixes, but either half should work on it’s own - eg writebuff and so on paired with a CSV.read table
Try using your M in place of the x2 above


#10

Thanks @y4lu. I did not manage to test your code except for reading the file.

julia> using DelimitedFiles

julia> readf = open("test.tsv")
IOStream(<file test.tsv>)

julia> readbuff = IOBuffer(read(readf));

julia> close(readf);

julia> @time x = readdlm(readbuff);
  6.580607 seconds (64.12 M allocations: 1.832 GiB, 8.59% gc time)

julia> writebuff = IOBuffer();
julia> using CSV

julia> @time CSV.write(writebuff, x; delim = '\t');
ERROR: ArgumentError: no default `Tables.rows` implementation for type: Array{Any,2}

julia> writef = open("testout.tsv");
ERROR: SystemError: opening file testout.tsv: Aucun fichier ou dossier de ce type

#11

I wonder whether this is somehow related to the relatively large number of columns… Both CSVFiles.jl and CSV.jl will process this data as iterators of NamedTuple (i.e. as rows), and I’ve never been sure whether there is maybe some cut-off where a named tuple with a very large number of fields gets inefficient?


#12

To write my big DataFrames, I found this solution, it is not elegant but almots 10x faster :

julia> @time printDF(M)
write results
 31.063968 seconds (129.47 M allocations: 42.875 GiB, 23.42% gc time)

function printDF(df)
    println("write results")
    open("df.csv", "w") do f
        write(f, join(names(M), "\t") * "\n") # print header
        for row in 1:size(df)[1]
            line = df[row,1]
            for col in 2:size(df)[2]
                if typeof(df[row, col]) == String 
                    c = df[row, col] 
                else 
                    c = string(df[row, col])
                end
                line = line * "\t" * c  # merge all cells of one row
            end
            write(f, "$line\n") # print df line by line
        end
    end
end

#13

@Fred I think the idea should that you didn’t have to do this to get speed. It should already be in a package. Maybe contribute this version back


#14

@xiaodai I completely agree with you but my solution is not good enough to be in a package, because it assumes that the dataframe as only 2 types : strings and floats (it even assumes that the row names are strings)


#15

You can also do show(f, MIME("text/csv"), df). This calls a simple function similar to the custom one you showed above.