If the global scope was the only reason, then I think using the function barrier should make the performance of both select and eachrow equal.
However, even after using the function barrier, it seems that eachrow is still more than 10x faster than select.
function use_select(df, cols)
for i in 1:length(cols)
select(df, cols[1:i]=>ByRow((x...)->f([x...])))
end
end
function use_eachrow(df, cols)
for i in 1:length(cols)
map(eachrow(df)) do row
row=row[cols[1:i]]
result=row|>collect|>f
result
end
end
end
julia> @time use_select(df, cols)
15.123364 seconds (19.28 M allocations: 1.144 GiB, 0.91% gc time, 99.72% compilation time)
julia> @time use_select(df, cols)
0.026519 seconds (295.76 k allocations: 13.219 MiB, 23.26% gc time, 10.57% compilation time)
julia> @time use_eachrow(df, cols)
0.065980 seconds (127.61 k allocations: 7.558 MiB, 97.26% compilation time)
julia> @time use_eachrow(df, cols)
0.001605 seconds (44.31 k allocations: 1.898 MiB)
I think the main difference should lie somewhere in the instrinsics of select, which I have no clue of. I thought using ByRow would yield the same performance as eachrow, so I’m a bit baffled.