Hi,
I have to read and write some big dataframes (100-500Mo). For a 80 Mo DataFrame, I need 220 sec to write it. And for a 500 Mo dataframe , after one hour waiting I gave up…
I use the following code :
julia> using CSV
julia> using DataFrames
julia> M = CSV.read("test.tsv"; delim='\t', header=true)
13893×1001 DataFrame. Omitted printing of 992 columns
julia> @time CSV.write("df.csv", M ; delim ='\t')
220.218516 seconds (13.83 M allocations: 880.227 MiB, 0.34% gc time)
"df.csv"
Hmmmm, that doesn’t seem great; is there any way you can share a sample of the DataFrame or at least the schema? I can try to figure out what’s causing the bottleneck.
What version of Julia are you using? I think CSV is being phased out in favor of DelimitedFiles. On my machine, using the file you have attached, I get:
julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, sandybridge)
Environment:
JULIA = /project/soft/linux64/src/julia-1.0.0
julia> using DelimitedFiles
julia> @time x = readdlm("test.tsv");
83.725932 seconds (396.76 M allocations: 9.211 GiB, 28.83% gc time)
julia> @time writedlm("test1.tsv", x)
5.158386 seconds (1.54 M allocations: 152.232 MiB)
Not at all, DelimitedFiles is just the same readdlm function which was provided by Julia Base in previous versions. It’s intended mainly at reading homogeneous arrays. CSV.jl is a more powerful package which should be used when reading tabular data (like data frames).
Today I did a test with the new Julia 1.0.2 and compared CSV and CSVFiles witht the “small” test.tsv file. Both are very slow. Just for fun I did a test with Julia 0.6.2 and it is much faster, so there is a 25x times regression speed to write the same data. So for big dataframes, I have no choice than staying with the old Julia 0.6 version
with Julia 1.0.2
julia> using DataFrames
julia> using CSV
julia> @time M = CSV.read("test.tsv"; delim='\t', header=true)
11.363795 seconds (62.35 M allocations: 1.948 GiB, 7.47% gc time)
13893×1001 DataFrame. Omitted printing of 992 columns
julia> @time CSV.write("df.csv", M ; delim ='\t')
206.409265 seconds (13.81 M allocations: 879.979 MiB, 0.17% gc time)
"df.csv"
julia> using DataFrames
julia> using CSVFiles
@time M = DataFrame(load("test.tsv"))
20.100596 seconds (67.96 M allocations: 3.183 GiB, 4.49% gc time)
13893×1001 DataFrame. Omitted printing of 992 columns
@time save("df2.tsv", M)
220.085667 seconds (20.76 M allocations: 1.296 GiB, 0.24% gc time)
with Julia 0.6.2
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.6.2 (2017-12-13 18:08 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-pc-linux-gnu
julia> using DataFrames
julia> using CSV
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /home/fred/.julia/v0.6/NullableArrays/src/operators.jl:99.
julia> @time M = CSV.read("test.tsv"; delim='\t', header=true);
14.923997 seconds (48.01 M allocations: 1.289 GiB, 1.79% gc time)
julia> @time CSV.write("df.csv", M ; delim ='\t')
9.370022 seconds (82.44 M allocations: 1.690 GiB, 3.53% gc time)
Looks like it could be missing or limited by the read buffer in readdlm, similar for the write buffer in CSV write.
Is this an OS thing, either buffering one or the other?
@fred
couple of fixes, but either half should work on it’s own - eg writebuff and so on paired with a CSV.read table
Try using your M in place of the x2 above
I wonder whether this is somehow related to the relatively large number of columns… Both CSVFiles.jl and CSV.jl will process this data as iterators of NamedTuple (i.e. as rows), and I’ve never been sure whether there is maybe some cut-off where a named tuple with a very large number of fields gets inefficient?
To write my big DataFrames, I found this solution, it is not elegant but almots 10x faster :
julia> @time printDF(M)
write results
31.063968 seconds (129.47 M allocations: 42.875 GiB, 23.42% gc time)
function printDF(df)
println("write results")
open("df.csv", "w") do f
write(f, join(names(M), "\t") * "\n") # print header
for row in 1:size(df)[1]
line = df[row,1]
for col in 2:size(df)[2]
if typeof(df[row, col]) == String
c = df[row, col]
else
c = string(df[row, col])
end
line = line * "\t" * c # merge all cells of one row
end
write(f, "$line\n") # print df line by line
end
end
end
@xiaodai I completely agree with you but my solution is not good enough to be in a package, because it assumes that the dataframe as only 2 types : strings and floats (it even assumes that the row names are strings)
@Juan read is not a problem, but write can be too long (more than one hour in my case) if you have many columns. So the solution is to use a custom function such as mine or @nalimilan.
If you don’t need DataFrames, I recommend you to use readdlm / writedlm, it is very fast.
julia> using DelimitedFiles
julia> open("delim_file.txt", "w") do io
writedlm(io, [x y])
end
julia> readdlm("delim_file.txt", '\t', Int, '\n')
It’s strange, on R’s data.table is the opposite, it needs much more time to read because it first needs to figure out the structure and what type of data are contained on each column and it needs to deal with missings.
This is a niche problem that not many would face. But I am facing this same problem. I was using CSV.jl to save dataframes and it was going fine. My scientific simulation has huge dataframes. (A Million rows) I was able to save because each element was not that big. But my latest simulation made a dataframe that was 3 GB in the RAM. CSV.write just was not able to save it and I waited for a couple of hours.
i have already optimised my data and I need all of it to do the next step.
I was wondering if there is any way to speed this up or any new library that would be able to reduce my time.
I do know the alternative way would be to write the row/data as it is created. But that would involve me writing the code again and I would like to have a shorter way.
Please make an MWE that reproduces the problem, and ideally report it as an issue in the CSV repository. I cannot reproduce it with a simple example, eg
using CSV, DataFrames
N = 10^8
df = DataFrame(; a = rand(Int, N), b = rand(Float64, N), c = rand('a':'z', N))
dst = "/tmp/data.csv"
@time CSV.write(dst, df)
@show filesize(dst) / 2^30 # file size in Gb
takes 45s on my laptop, creating an almost 4GB file (using an SSD). I find this reasonable.