Hand rolled filter faster than filter function for dataframe?

From the MWE:

0.019227 seconds (313 allocations: 3.065 MiB)
0.037725 seconds (599.53 k allocations: 15.439 MiB, 8.73% gc time)

Faster and much, much fewer allocations. This result is better than what I see in the actual program where the hand-rolled filter (which doesn’t even pre-allocate for the result) is 4x faster. However my dataset is smaller, so there’s probably some 1-time overhead in the filter function that dominates. I think the MWE is a more realistic result.

What’s really interesting is if you replace my loop with something like

class = df[!,:b] .> 5

The allocations go way up and the run time is much worse:

0.273326 seconds (277.86 k allocations: 17.071 MiB)
0.044197 seconds (599.55 k allocations: 15.438 MiB, 9.76% gc time)

type instability ?

using DataFrames

function test()
    n = 200000
    df = DataFrame([[rand(1:10) for _ in 1:n],
                    [rand(1:10) for _ in 1:n],
                    [rand(1:10) for _ in 1:n]], [:a, :b, :c])

    class = zeros(Bool, size(df,1))
    m = 0
    for i=1:size(df,1)
        class[i] = df[i,:b] > 5
        if class[i]
            m += 1
        end
    end
    println(m)
    
    df1_ = @time df[class,:]
    
    df1 = @time filter(r->r[:b] > 5, df)
end
test()


It doesn’t seem like you’re really timing the same things here:

df1_ = @time df[class, :]

This is just timing how long it takes to index a DataFrame w/ a Bool vector

vs.

df1 = @time filter(r->r[:b] > 5, df)

this is applying the r->r[:b] > 5 function to each DataFrameRow and only returning the row when it’s true.

You’re being very picky :wink:

LOL. so, someday i’m going to ask a question on this forum that doesn’t embarrass me. But today is not that day.

Bad news - i’ve embarrassed myself again…

Good news - it shouldn’t be faster and it’s not.

Thank you!

2 Likes