From the MWE:
0.019227 seconds (313 allocations: 3.065 MiB)
0.037725 seconds (599.53 k allocations: 15.439 MiB, 8.73% gc time)
Faster and much, much fewer allocations. This result is better than what I see in the actual program where the hand-rolled filter (which doesn’t even pre-allocate for the result) is 4x faster. However my dataset is smaller, so there’s probably some 1-time overhead in the filter function that dominates. I think the MWE is a more realistic result.
What’s really interesting is if you replace my loop with something like
class = df[!,:b] .> 5
The allocations go way up and the run time is much worse:
0.273326 seconds (277.86 k allocations: 17.071 MiB)
0.044197 seconds (599.55 k allocations: 15.438 MiB, 9.76% gc time)
type instability ?
using DataFrames
function test()
n = 200000
df = DataFrame([[rand(1:10) for _ in 1:n],
[rand(1:10) for _ in 1:n],
[rand(1:10) for _ in 1:n]], [:a, :b, :c])
class = zeros(Bool, size(df,1))
m = 0
for i=1:size(df,1)
class[i] = df[i,:b] > 5
if class[i]
m += 1
end
end
println(m)
df1_ = @time df[class,:]
df1 = @time filter(r->r[:b] > 5, df)
end
test()