When I switched to CSV instead of readtable() to read tabular data, I was surprised to see an extreme slowdown. I attached a small benchmark which seems to indicate that the slowdown is in the append!() function.
Thank you for any advice !
using DataFrames
using CSV
function newDF(nrow::Int64)
x = CSV.read("test.tsv"; delim="\t", header=true)
y = similar(x,0)
for i in 1:nrow
append!(y,x[i,:])
end
end
function oldDF(nrow::Int64)
x = readtable("test.tsv", separator='\t', header=true)
y = similar(x,0)
for i in 1:nrow
append!(y,x[i,:])
end
end
@time oldDF(3000)
@time newDF(3000)
@time oldDF(3000)
@time newDF(3000)
#=
Results :
readtable()
WARNING: readtable is deprecated, use CSV.read from the CSV package instead
4.402112 seconds (3.77 M allocations: 257.848 MiB, 4.21% gc time)
CSV
240.959365 seconds (518.44 M allocations: 41.142 GiB, 4.02% gc time)
readtable()
WARNING: readtable is deprecated, use CSV.read from the CSV package instead
0.947368 seconds (1.86 M allocations: 157.937 MiB, 15.46% gc time)
CSV
231.824696 seconds (513.78 M allocations: 40.898 GiB, 4.00% gc time)
=#
What makes you think it’s due to append!? You should move the append! calls to a separate function to see that. Anyway, calling append! like that is really not efficient. What’s the actual use case?
Is there a way you could reduce the file size while still showing the performance difference? Answers tend to be much better and accurate if it is possible to actually try out the code instead of guessing what is happening.
This is a new version of the test code with every step isolated in separate functions
using DataFrames
using CSV
function newDF(nrow::Int64)
x = CSV.read("test.tsv"; delim="\t", header=true)
y = similar(x,0)
loop(y, x, nrow)
end
function oldDF(nrow::Int64)
x = readtable("test.tsv", separator='\t', header=true)
y = similar(x,0)
loop(y, x, nrow)
end
function loop(y::DataFrame, x::DataFrame, nrow::Int64)
for i in 1:nrow
addRow(y, x[i,:])
end
end
function addRow(df::DataFrame, row::DataFrame)
append!(df,row)
end
@time oldDF(1999)
@time newDF(1999)
@time oldDF(1999)
@time newDF(1999)
#=
WARNING: readtable is deprecated, use CSV.read from the CSV package instead
3.547139 seconds (2.17 M allocations: 113.254 MiB, 1.16% gc time)
12.916447 seconds (15.24 M allocations: 1.090 GiB, 1.35% gc time)
WARNING: readtable is deprecated, use CSV.read from the CSV package instead
0.118423 seconds (260.28 k allocations: 13.271 MiB)
5.810863 seconds (10.58 M allocations: 866.798 MiB, 1.64% gc time)
=#
Thanks, but that doesn’t address my question: is the time due to reading the TSV file, or to calling loop on the resulting data frame? You need to call @time separately to find this out.
The read time is even faster with the new dataframe so I think it is the append() which is much slower
using DataFrames
using CSV
function newDF(nrow::Int64)
@time x = CSV.read("test.tsv"; delim="\t", header=true)
y = similar(x,0)
loop(y, x, nrow)
end
function oldDF(nrow::Int64)
@time x = readtable("test.tsv", separator='\t', header=true)
y = similar(x,0)
loop(y, x, nrow)
end
function loop(y::DataFrame, x::DataFrame, nrow::Int64)
for i in 1:nrow
addRow(y, x[i,:])
end
end
function addRow(df::DataFrame, row::DataFrame)
append!(df,row)
end
@time oldDF(1999)
@time newDF(1999)
@time oldDF(1999)
@time newDF(1999)
#=
WARNING: readtable is deprecated, use CSV.read from the CSV package instead
2.635364 seconds (1.42 M allocations: 74.783 MiB, 1.01% gc time)
3.699579 seconds (2.27 M allocations: 118.356 MiB, 1.30% gc time)
4.071875 seconds (1.59 M allocations: 84.596 MiB, 0.82% gc time)
12.948613 seconds (15.24 M allocations: 1.090 GiB, 1.45% gc time)
WARNING: readtable is deprecated, use CSV.read from the CSV package instead
0.017080 seconds (76.86 k allocations: 4.369 MiB) readtime old df
0.122019 seconds (260.42 k allocations: 13.278 MiB) append time old df
0.007308 seconds (54.06 k allocations: 1.844 MiB) readtime new df
5.860227 seconds (10.58 M allocations: 866.799 MiB, 1.70% gc time) append time new df
OK, so that must be due to the fact that columns are Vector{Union{T, Missing}} rather than DataVector{T}. I guess you’re on Julia 0.6? It would be worth trying on 0.7, where the memory layout should be much more efficient.
Anyway, do you really need to call append! repeatedly like that? It should be much faster to specify the final number of rows when calling similar, and then assign the values to each row. There are probably even more efficient approaches, depending on what you actually need to do.
0.085393 seconds (199.48 k allocations: 11.631 MiB, 3.11% gc time) old
5.846089 seconds (8.43 M allocations: 814.511 MiB, 1.37% gc time) new
almost a factor 7 slowdown
I admit that calling append!() repeatedly is not elegant but I have to filter large datafile and I cannot know the final number of rows until everything is filtered so a speed factor of 7 can make a good difference.
Hmm… Looks like some major optimizations are missing here. If you want to help making progress, it would be useful to check whether you can reproduce this without DataFrames, i.e. by calling append! directly on vectors, comparing Vector{T}, DataVector{T} with Vector{Union{T, Missing}}. Then it would be worth filing an issue in Julia.
If you need to filter a data set, maybe you can fill a boolean vector indicating whether a row should be kept or not, and then call getindex on the data frame with it? That would be dramatically faster. There’s also an (unoptimized as the moment) filter function in the latest DataFrames version which could be used to get the same result.
I want to select in a very large table (>30 000 rows) some rows that satisfy a test and merge all theses selected rows in a new table with exactly the same header than the initial table