One way to avoid having broadcast allocating this matrix before summing it, is to give it temporary space to write into. If you do this many times, you can re-use the same space. This is J3!
below.
I wrote a package for performing such reductions, just uses loops instead of broadcasting to a matrix at all – J4
below. You could probably get similar performance by just writing out for
loops, by hand. If the arrays are large enough, the multi-threading will also help.
julia> J(dfdWdP, f, g, f̄=0.1, ḡ=0.2, f̃=0.3, g̃=0.4) = sum((dfdWdP.*(f.- f̄ - ((f̃/g̃)*(g.-ḡ)))), dims=1)'; # original
# from DNF, more dots:
julia> J2(dfdWdP, f, g, f̄=0.1, ḡ=0.2, f̃=0.3, g̃=0.4) = sum((dfdWdP .* (f .- f̄ .- ((f̃/g̃).*(g.-ḡ)))); dims=1)';
# broadcast into pre-allocated matrix:
julia> J3!(tmp, dfdWdP, f, g, f̄=0.1, ḡ=0.2, f̃=0.3, g̃=0.4) = sum(tmp .= (dfdWdP .* (f .- f̄ .- ((f̃/g̃).*(g.-ḡ)))); dims=1)';
julia> using Tullio, BenchmarkTools
# with @tullio, will multi-thread if large:
julia> J4(dfdWdP, f, g, f̄=0.1, ḡ=0.2, f̃=0.3, g̃=0.4) = @tullio J[j] := dfdWdP[i,j] * (f[i] - f̄ - (f̃/g̃)*(g[i]-ḡ))
J4 (generic function with 5 methods)
julia> for N in [10, 1000]
@show N
dfdWdP = rand(N,N)
f = rand(N)
g = rand(N)
tmp = similar(dfdWdP)
res1 = @btime J($dfdWdP, $f, $g)
res2 = @btime J2($dfdWdP, $f, $g)
res3 = @btime J3!($tmp, $dfdWdP, $f, $g)
res4 = @btime J4($dfdWdP, $f, $g)
@assert res1 ≈ res3 ≈ res4
end
N = 10
min 174.988 ns, mean 258.013 ns (12 allocations, 1.64 KiB)
min 101.095 ns, mean 150.852 ns (4 allocations, 1.08 KiB)
min 89.367 ns, mean 99.166 ns (2 allocations, 160 bytes)
min 68.733 ns, mean 90.173 ns (3 allocations, 224 bytes)
N = 1000
min 490.875 μs, mean 1.372 ms (18 allocations, 7.67 MiB)
min 410.250 μs, mean 1.212 ms (6 allocations, 7.64 MiB)
min 341.000 μs, mean 419.709 μs (3 allocations, 7.95 KiB)
min 79.125 μs, mean 85.786 μs (57 allocations, 10.91 KiB)