Thanks for chiming in, @bkamins comments are way over my head. This looks very good, indeed. Is it sensible to talk about the generalized case (finding all duplicates) or should we leave it at this?
@goerch Your solution looks much better! I actually just wanted to point out that the calculation should be embedded in a function and didnāt do any benchmarks or tests.
@pdeffebach DataFramesMeta.jl
looks very impressive! I will definitely take a closer look at the package!
No problem here. My intention is to understand what can and canāt be done when transferring common SQL knowledge to DataFrames
and for now Iāve failed to construct the necessary self join in this example (and I suspect that in a case like this we are on our own to construct indices).
To understand @bkamins brief comments above, I found another one from him, which provides the details.
Consider the following minimal example:
function f1(df)
s = 0
n = nrow(df)
for i in 1:n
for j in 1:n
s += df.x[i] * df.y[j]
end
end
return s
end
f2(df) = helper(df.x, df.y, nrow(df))
function helper(x, y, n)
s = 0
for i in 1:n
for j in 1:n
s += x[i] * y[j]
end
end
return s
end
and now we benchmark it:
julia> using BenchmarkTools
julia> df = DataFrame(x=1:10^3, y=1:10^3);
julia> @btime f1($df)
114.867 ms (3952700 allocations: 60.31 MiB)
250500250000
julia> @btime f2($df)
490.200 Ī¼s (3 allocations: 48 bytes)
250500250000
And as @pdeffebach commented in DataFramesMeta.jl there are convenience macros that simplify this process.
Another option is:
julia> function f3(df)
s = 0
n = length(df.x)
for i in 1:n
for j in 1:n
s += df.x[i] * df.y[j]
end
end
return s
end;
julia> @btime f3(Tables.columntable($df))
571.700 Ī¼s (9 allocations: 320 bytes)
250500250000