How to speed up the for-loop with dataframe access

goerch · April 13, 2022, 10:47pm

Thanks for chiming in, @bkamins comments are way over my head. This looks very good, indeed. Is it sensible to talk about the generalized case (finding all duplicates) or should we leave it at this?

kfrb · April 13, 2022, 11:19pm

@goerch Your solution looks much better! I actually just wanted to point out that the calculation should be embedded in a function and didn’t do any benchmarks or tests.

@pdeffebach DataFramesMeta.jl looks very impressive! I will definitely take a closer look at the package!

goerch · April 13, 2022, 11:28pm

No problem here. My intention is to understand what can and can’t be done when transferring common SQL knowledge to DataFrames and for now I’ve failed to construct the necessary self join in this example (and I suspect that in a case like this we are on our own to construct indices).

qsong · April 14, 2022, 6:17am

To understand @bkamins brief comments above, I found another one from him, which provides the details.

bkamins · April 14, 2022, 6:42am

Consider the following minimal example:

function f1(df)
    s = 0
    n = nrow(df)
    for i in 1:n
        for j in 1:n
            s += df.x[i] * df.y[j]
        end
    end
    return s
end

f2(df) = helper(df.x, df.y, nrow(df))

function helper(x, y, n)
    s = 0
    for i in 1:n
        for j in 1:n
            s += x[i] * y[j]
        end
    end
    return s
end

and now we benchmark it:

julia> using BenchmarkTools

julia> df = DataFrame(x=1:10^3, y=1:10^3);

julia> @btime f1($df)
  114.867 ms (3952700 allocations: 60.31 MiB)
250500250000

julia> @btime f2($df)
  490.200 μs (3 allocations: 48 bytes)
250500250000

And as @pdeffebach commented in DataFramesMeta.jl there are convenience macros that simplify this process.

bkamins · April 14, 2022, 6:44am

Another option is:

julia> function f3(df)
           s = 0
           n = length(df.x)
               for i in 1:n
                   for j in 1:n
                       s += df.x[i] * df.y[j]
                   end
           end
           return s
       end;

julia> @btime f3(Tables.columntable($df))
  571.700 μs (9 allocations: 320 bytes)
250500250000

Topic		Replies	Views
Can this be made faster? Performance dataframes	5	550	March 19, 2022
Methods to reduce gc time? Performance	7	8003	February 7, 2018
Slow and Memory intensive For Loop Performance question	6	525	March 30, 2024
Fast iteration over rows of a DataFrame Performance	14	14143	June 30, 2020
Looping over previous row efficiency New to Julia loops , dataframes	1	431	February 3, 2022

How to speed up the for-loop with dataframe access

Related topics