Reducing Materialize in Profiling


I am profiling my code and roughly 75% of the following line’s time is occupied with Materialize, any ideia how to optimize this?

J = sum((dfdWdP.*(f.- f̄ - ((f̃/g̃)*(g.-ḡ)))), dims=1)'

dfdWdP is a Matrix{Float64}
f and g are Vector{Float64}
f,g bar and f,g tilde are Float64

The issue is that the sum forces the materialization of the broadcast expression.

You might be able to use something like LazyArrays.jl to delay the broadcast expression and fuse the sum.

1 Like

It looks like there’s something off about your formatting. It looked like you had forgotten some multiplication signs:

but in quotes * gets turned into italics. Use triple backticks instead.

Also I think that you have missed some dots, which will cause even more materialization and allocation. Try

J = sum((dfdWdP .* (f .- f̄ .- ((f̃/g̃).*(g.-ḡ)))); dims=1)

((There are probably too many parentheses in there, too)).

1 Like

I think the quotes ommited the multiplication signs.

Your suggestion on the dots really helped. I still get a lot of Materialization, but a little less than before. Thanks!

Yeah, quotes are for text, and backticks for code. You can use single backticks for inline code:

1 Like

One way to avoid having broadcast allocating this matrix before summing it, is to give it temporary space to write into. If you do this many times, you can re-use the same space. This is J3! below.

I wrote a package for performing such reductions, just uses loops instead of broadcasting to a matrix at all – J4 below. You could probably get similar performance by just writing out for loops, by hand. If the arrays are large enough, the multi-threading will also help.

julia> J(dfdWdP, f, g, f̄=0.1, ḡ=0.2, f̃=0.3, g̃=0.4) = sum((dfdWdP.*(f.- f̄ - ((f̃/g̃)*(g.-ḡ)))), dims=1)';  # original

# from DNF, more dots:
julia> J2(dfdWdP, f, g, f̄=0.1, ḡ=0.2, f̃=0.3, g̃=0.4) = sum((dfdWdP .* (f .- f̄ .- ((f̃/g̃).*(g.-ḡ)))); dims=1)';

# broadcast into pre-allocated matrix:
julia> J3!(tmp, dfdWdP, f, g, f̄=0.1, ḡ=0.2, f̃=0.3, g̃=0.4) = sum(tmp .= (dfdWdP .* (f .- f̄ .- ((f̃/g̃).*(g.-ḡ)))); dims=1)';

julia> using Tullio, BenchmarkTools

# with @tullio, will multi-thread if large:
julia> J4(dfdWdP, f, g, f̄=0.1, ḡ=0.2, f̃=0.3, g̃=0.4) = @tullio J[j] := dfdWdP[i,j] * (f[i] - f̄ - (f̃/g̃)*(g[i]-ḡ))
J4 (generic function with 5 methods)

julia> for N in [10, 1000]
         @show N
         dfdWdP = rand(N,N)
         f = rand(N)
         g = rand(N)
         tmp = similar(dfdWdP)
         res1 = @btime J($dfdWdP, $f, $g)
         res2 = @btime J2($dfdWdP, $f, $g)
         res3 = @btime J3!($tmp, $dfdWdP, $f, $g)
         res4 = @btime J4($dfdWdP, $f, $g)
         @assert res1 ≈ res3 ≈ res4
N = 10
  min 174.988 ns, mean 258.013 ns (12 allocations, 1.64 KiB)
  min 101.095 ns, mean 150.852 ns (4 allocations, 1.08 KiB)
  min 89.367 ns, mean 99.166 ns (2 allocations, 160 bytes)
  min 68.733 ns, mean 90.173 ns (3 allocations, 224 bytes)
N = 1000
  min 490.875 μs, mean 1.372 ms (18 allocations, 7.67 MiB)
  min 410.250 μs, mean 1.212 ms (6 allocations, 7.64 MiB)
  min 341.000 μs, mean 419.709 μs (3 allocations, 7.95 KiB)
  min 79.125 μs, mean 85.786 μs (57 allocations, 10.91 KiB)