How to speed up the for-loop with dataframe access

Thanks for chiming in, @bkamins comments are way over my head. This looks very good, indeed. Is it sensible to talk about the generalized case (finding all duplicates) or should we leave it at this?

@goerch Your solution looks much better! I actually just wanted to point out that the calculation should be embedded in a function and didnā€™t do any benchmarks or tests.

@pdeffebach DataFramesMeta.jl looks very impressive! I will definitely take a closer look at the package!

No problem here. My intention is to understand what can and canā€™t be done when transferring common SQL knowledge to DataFrames and for now Iā€™ve failed to construct the necessary self join in this example (and I suspect that in a case like this we are on our own to construct indices).

To understand @bkamins brief comments above, I found another one from him, which provides the details.

Consider the following minimal example:

function f1(df)
    s = 0
    n = nrow(df)
    for i in 1:n
        for j in 1:n
            s += df.x[i] * df.y[j]
        end
    end
    return s
end

f2(df) = helper(df.x, df.y, nrow(df))

function helper(x, y, n)
    s = 0
    for i in 1:n
        for j in 1:n
            s += x[i] * y[j]
        end
    end
    return s
end

and now we benchmark it:

julia> using BenchmarkTools

julia> df = DataFrame(x=1:10^3, y=1:10^3);

julia> @btime f1($df)
  114.867 ms (3952700 allocations: 60.31 MiB)
250500250000

julia> @btime f2($df)
  490.200 Ī¼s (3 allocations: 48 bytes)
250500250000

And as @pdeffebach commented in DataFramesMeta.jl there are convenience macros that simplify this process.

5 Likes

Another option is:

julia> function f3(df)
           s = 0
           n = length(df.x)
               for i in 1:n
                   for j in 1:n
                       s += df.x[i] * df.y[j]
                   end
           end
           return s
       end;

julia> @btime f3(Tables.columntable($df))
  571.700 Ī¼s (9 allocations: 320 bytes)
250500250000
2 Likes