Append!() function extremely slow in DataFrames + CSV

package

#1

Hi,

When I switched to CSV instead of readtable() to read tabular data, I was surprised to see an extreme slowdown. I attached a small benchmark which seems to indicate that the slowdown is in the append!() function.

Thank you for any advice !

using DataFrames
using CSV

function newDF(nrow::Int64)
    x = CSV.read("test.tsv"; delim="\t", header=true)
    y = similar(x,0)

    for i in 1:nrow
        append!(y,x[i,:])
    end 
end

function oldDF(nrow::Int64)
    x = readtable("test.tsv", separator='\t', header=true)
    y = similar(x,0)

    for i in 1:nrow
        append!(y,x[i,:])
    end
end

@time oldDF(3000)
@time newDF(3000)
@time oldDF(3000)
@time newDF(3000)

#=
Results :
readtable()
WARNING: readtable is deprecated, use CSV.read from the CSV package instead
4.402112 seconds (3.77 M allocations: 257.848 MiB, 4.21% gc time)

CSV
240.959365 seconds (518.44 M allocations: 41.142 GiB, 4.02% gc time)

readtable()
WARNING: readtable is deprecated, use CSV.read from the CSV package instead
0.947368 seconds (1.86 M allocations: 157.937 MiB, 15.46% gc time)

CSV
231.824696 seconds (513.78 M allocations: 40.898 GiB, 4.00% gc time)
=#

#2

Could you put the test.tsv file somewhere so it is easier to run your code.


#3

What makes you think it’s due to append!? You should move the append! calls to a separate function to see that. Anyway, calling append! like that is really not efficient. What’s the actual use case?


#4

@kristoffer.carlsson, the file is large (30Mo) is there a way to attach a file here ?


#5

Is there a way you could reduce the file size while still showing the performance difference? Answers tend to be much better and accurate if it is possible to actually try out the code instead of guessing what is happening.


#6

@kristoffer.carlsson : a smaller test.tsv file is available here
test.tsv
(don’t forget to reduce the row number

@time oldDF(1999)
@time newDF(1999)

@nalimilan

This is a new version of the test code with every step isolated in separate functions :wink:

using DataFrames
using CSV

function newDF(nrow::Int64)
    x = CSV.read("test.tsv"; delim="\t", header=true)
    y = similar(x,0)
    loop(y, x, nrow)
end

function oldDF(nrow::Int64)
    x = readtable("test.tsv", separator='\t', header=true)
    y = similar(x,0)
    loop(y, x, nrow)
end


function loop(y::DataFrame, x::DataFrame, nrow::Int64)
    for i in 1:nrow
        addRow(y, x[i,:])
    end
end

function addRow(df::DataFrame, row::DataFrame)
    append!(df,row)
end

@time oldDF(1999)
@time newDF(1999)
@time oldDF(1999)
@time newDF(1999)

#=
WARNING: readtable is deprecated, use CSV.read from the CSV package instead
 3.547139 seconds (2.17 M allocations: 113.254 MiB, 1.16% gc time)
 12.916447 seconds (15.24 M allocations: 1.090 GiB, 1.35% gc time)


WARNING: readtable is deprecated, use CSV.read from the CSV package instead
  0.118423 seconds (260.28 k allocations: 13.271 MiB)
  5.810863 seconds (10.58 M allocations: 866.798 MiB, 1.64% gc time)

=#

#7

Thanks, but that doesn’t address my question: is the time due to reading the TSV file, or to calling loop on the resulting data frame? You need to call @time separately to find this out.


#8

@nalimilan you are absolutely right :wink:

The read time is even faster with the new dataframe so I think it is the append() which is much slower :slight_smile:

using DataFrames
using CSV


function newDF(nrow::Int64)
    @time  x = CSV.read("test.tsv"; delim="\t", header=true)
    y = similar(x,0)
    loop(y, x, nrow)
end

function oldDF(nrow::Int64)
    @time x = readtable("test.tsv", separator='\t', header=true)
    y = similar(x,0)
    loop(y, x, nrow)
end


function loop(y::DataFrame, x::DataFrame, nrow::Int64)
    for i in 1:nrow
        addRow(y, x[i,:])
    end
end

function addRow(df::DataFrame, row::DataFrame)
    append!(df,row)
end

@time oldDF(1999)
@time newDF(1999)
@time oldDF(1999)
@time newDF(1999)

#=
WARNING: readtable is deprecated, use CSV.read from the CSV package instead
  2.635364 seconds (1.42 M allocations: 74.783 MiB, 1.01% gc time)
  3.699579 seconds (2.27 M allocations: 118.356 MiB, 1.30% gc time)
  4.071875 seconds (1.59 M allocations: 84.596 MiB, 0.82% gc time)
 12.948613 seconds (15.24 M allocations: 1.090 GiB, 1.45% gc time)


WARNING: readtable is deprecated, use CSV.read from the CSV package instead
  0.017080 seconds (76.86 k allocations: 4.369 MiB)                     readtime old df
  0.122019 seconds (260.42 k allocations: 13.278 MiB)                   append time old df
  0.007308 seconds (54.06 k allocations: 1.844 MiB)                     readtime new df
  5.860227 seconds (10.58 M allocations: 866.799 MiB, 1.70% gc time)    append time new df

#9

OK, so that must be due to the fact that columns are Vector{Union{T, Missing}} rather than DataVector{T}. I guess you’re on Julia 0.6? It would be worth trying on 0.7, where the memory layout should be much more efficient.

Anyway, do you really need to call append! repeatedly like that? It should be much faster to specify the final number of rows when calling similar, and then assign the values to each row. There are probably even more efficient approaches, depending on what you actually need to do.


#10

@nalimilan here are the results with Julia 0.7

0.085393 seconds (199.48 k allocations: 11.631 MiB, 3.11% gc time) old
5.846089 seconds (8.43 M allocations: 814.511 MiB, 1.37% gc time) new

almost a factor 7 slowdown

I admit that calling append!() repeatedly is not elegant but I have to filter large datafile and I cannot know the final number of rows until everything is filtered so a speed factor of 7 can make a good difference.


#11

Maybe ElasticArrays could help your append?


#12

Hmm… Looks like some major optimizations are missing here. If you want to help making progress, it would be useful to check whether you can reproduce this without DataFrames, i.e. by calling append! directly on vectors, comparing Vector{T}, DataVector{T} with Vector{Union{T, Missing}}. Then it would be worth filing an issue in Julia.

If you need to filter a data set, maybe you can fill a boolean vector indicating whether a row should be kept or not, and then call getindex on the data frame with it? That would be dramatically faster. There’s also an (unoptimized as the moment) filter function in the latest DataFrames version which could be used to get the same result.


Push! is very slow on Vector{Union{...}}
#13

I think it is indeed possible to do this in two steps :slight_smile:


#14

CSV.jl also supports appending to an existing sink natively, so you can do something like:

sink = CSV.read("test.tsv"; delim='\t', header=true)
for i = 1:nrow
    CSV.read("test.tsv", sink; delim='\t', header=true, append=true)
end

though it’s unclear if that’s what you’re really trying to do here or not (is it just a one-row file?)


#15

Thank you for this remark @quinnj

it is not a one row file.

I want to select in a very large table (>30 000 rows) some rows that satisfy a test and merge all theses selected rows in a new table with exactly the same header than the initial table :slight_smile: