Append!() function extremely slow in DataFrames + CSV

Fred · January 16, 2018, 9:57am

Hi,

When I switched to CSV instead of readtable() to read tabular data, I was surprised to see an extreme slowdown. I attached a small benchmark which seems to indicate that the slowdown is in the append!() function.

Thank you for any advice !

using DataFrames
using CSV

function newDF(nrow::Int64)
    x = CSV.read("test.tsv"; delim="\t", header=true)
    y = similar(x,0)

    for i in 1:nrow
        append!(y,x[i,:])
    end 
end

function oldDF(nrow::Int64)
    x = readtable("test.tsv", separator='\t', header=true)
    y = similar(x,0)

    for i in 1:nrow
        append!(y,x[i,:])
    end
end

@time oldDF(3000)
@time newDF(3000)
@time oldDF(3000)
@time newDF(3000)

#=
Results :
readtable()
WARNING: readtable is deprecated, use CSV.read from the CSV package instead
4.402112 seconds (3.77 M allocations: 257.848 MiB, 4.21% gc time)

CSV
240.959365 seconds (518.44 M allocations: 41.142 GiB, 4.02% gc time)

readtable()
WARNING: readtable is deprecated, use CSV.read from the CSV package instead
0.947368 seconds (1.86 M allocations: 157.937 MiB, 15.46% gc time)

CSV
231.824696 seconds (513.78 M allocations: 40.898 GiB, 4.00% gc time)
=#

kristoffer.carlsson · January 16, 2018, 10:07am

Could you put the test.tsv file somewhere so it is easier to run your code.

nalimilan · January 16, 2018, 10:24am

What makes you think it’s due to append!? You should move the append! calls to a separate function to see that. Anyway, calling append! like that is really not efficient. What’s the actual use case?

Fred · January 16, 2018, 10:25am

@kristoffer.carlsson, the file is large (30Mo) is there a way to attach a file here ?

kristoffer.carlsson · January 16, 2018, 10:27am

Is there a way you could reduce the file size while still showing the performance difference? Answers tend to be much better and accurate if it is possible to actually try out the code instead of guessing what is happening.

Fred · January 16, 2018, 10:46am

@kristoffer.carlsson : a smaller test.tsv file is available here
test.tsv
(don’t forget to reduce the row number

@time oldDF(1999)
@time newDF(1999)

@nalimilan

This is a new version of the test code with every step isolated in separate functions

using DataFrames
using CSV

function newDF(nrow::Int64)
    x = CSV.read("test.tsv"; delim="\t", header=true)
    y = similar(x,0)
    loop(y, x, nrow)
end

function oldDF(nrow::Int64)
    x = readtable("test.tsv", separator='\t', header=true)
    y = similar(x,0)
    loop(y, x, nrow)
end


function loop(y::DataFrame, x::DataFrame, nrow::Int64)
    for i in 1:nrow
        addRow(y, x[i,:])
    end
end

function addRow(df::DataFrame, row::DataFrame)
    append!(df,row)
end

@time oldDF(1999)
@time newDF(1999)
@time oldDF(1999)
@time newDF(1999)

#=
WARNING: readtable is deprecated, use CSV.read from the CSV package instead
 3.547139 seconds (2.17 M allocations: 113.254 MiB, 1.16% gc time)
 12.916447 seconds (15.24 M allocations: 1.090 GiB, 1.35% gc time)


WARNING: readtable is deprecated, use CSV.read from the CSV package instead
  0.118423 seconds (260.28 k allocations: 13.271 MiB)
  5.810863 seconds (10.58 M allocations: 866.798 MiB, 1.64% gc time)

=#

nalimilan · January 16, 2018, 10:51am

Thanks, but that doesn’t address my question: is the time due to reading the TSV file, or to calling loop on the resulting data frame? You need to call @time separately to find this out.

Fred · January 16, 2018, 10:59am

@nalimilan you are absolutely right

The read time is even faster with the new dataframe so I think it is the append() which is much slower

using DataFrames
using CSV


function newDF(nrow::Int64)
    @time  x = CSV.read("test.tsv"; delim="\t", header=true)
    y = similar(x,0)
    loop(y, x, nrow)
end

function oldDF(nrow::Int64)
    @time x = readtable("test.tsv", separator='\t', header=true)
    y = similar(x,0)
    loop(y, x, nrow)
end


function loop(y::DataFrame, x::DataFrame, nrow::Int64)
    for i in 1:nrow
        addRow(y, x[i,:])
    end
end

function addRow(df::DataFrame, row::DataFrame)
    append!(df,row)
end

@time oldDF(1999)
@time newDF(1999)
@time oldDF(1999)
@time newDF(1999)

#=
WARNING: readtable is deprecated, use CSV.read from the CSV package instead
  2.635364 seconds (1.42 M allocations: 74.783 MiB, 1.01% gc time)
  3.699579 seconds (2.27 M allocations: 118.356 MiB, 1.30% gc time)
  4.071875 seconds (1.59 M allocations: 84.596 MiB, 0.82% gc time)
 12.948613 seconds (15.24 M allocations: 1.090 GiB, 1.45% gc time)


WARNING: readtable is deprecated, use CSV.read from the CSV package instead
  0.017080 seconds (76.86 k allocations: 4.369 MiB)                     readtime old df
  0.122019 seconds (260.42 k allocations: 13.278 MiB)                   append time old df
  0.007308 seconds (54.06 k allocations: 1.844 MiB)                     readtime new df
  5.860227 seconds (10.58 M allocations: 866.799 MiB, 1.70% gc time)    append time new df

nalimilan · January 16, 2018, 11:03am

OK, so that must be due to the fact that columns are Vector{Union{T, Missing}} rather than DataVector{T}. I guess you’re on Julia 0.6? It would be worth trying on 0.7, where the memory layout should be much more efficient.

Anyway, do you really need to call append! repeatedly like that? It should be much faster to specify the final number of rows when calling similar, and then assign the values to each row. There are probably even more efficient approaches, depending on what you actually need to do.

Fred · January 16, 2018, 11:18am

@nalimilan here are the results with Julia 0.7

0.085393 seconds (199.48 k allocations: 11.631 MiB, 3.11% gc time) old
5.846089 seconds (8.43 M allocations: 814.511 MiB, 1.37% gc time) new

almost a factor 7 slowdown

I admit that calling append!() repeatedly is not elegant but I have to filter large datafile and I cannot know the final number of rows until everything is filtered so a speed factor of 7 can make a good difference.

rveltz · January 16, 2018, 11:52am

Maybe ElasticArrays could help your append?

nalimilan · January 16, 2018, 11:52am

Hmm… Looks like some major optimizations are missing here. If you want to help making progress, it would be useful to check whether you can reproduce this without DataFrames, i.e. by calling append! directly on vectors, comparing Vector{T}, DataVector{T} with Vector{Union{T, Missing}}. Then it would be worth filing an issue in Julia.

If you need to filter a data set, maybe you can fill a boolean vector indicating whether a row should be kept or not, and then call getindex on the data frame with it? That would be dramatically faster. There’s also an (unoptimized as the moment) filter function in the latest DataFrames version which could be used to get the same result.

Fred · January 16, 2018, 12:58pm

I think it is indeed possible to do this in two steps

quinnj · January 16, 2018, 2:10pm

CSV.jl also supports appending to an existing sink natively, so you can do something like:

sink = CSV.read("test.tsv"; delim='\t', header=true)
for i = 1:nrow
    CSV.read("test.tsv", sink; delim='\t', header=true, append=true)
end

though it’s unclear if that’s what you’re really trying to do here or not (is it just a one-row file?)

Fred · January 16, 2018, 2:21pm

Thank you for this remark @quinnj

it is not a one row file.

I want to select in a very large table (>30 000 rows) some rows that satisfy a test and merge all theses selected rows in a new table with exactly the same header than the initial table

Topic		Replies	Views
CSV.read extremely slow wrt readtable Data	14	3620	July 27, 2018
CSV read in is too slow than other language General Usage performance	13	1316	June 21, 2023
Push! is very slow on Vector{Union{...}} Internals & Design	2	834	July 12, 2018
Read csv files slow Performance filesystem	13	1681	July 28, 2020
Help with appending row in read in DataFrame (weird behavior) General Usage dataframes , csv	6	531	September 24, 2022

Append!() function extremely slow in DataFrames + CSV

Related topics