Reducing Materialize in Profiling

dennispureza · February 24, 2024, 1:12pm

Hi!

I am profiling my code and roughly 75% of the following line’s time is occupied with Materialize, any ideia how to optimize this?

J = sum((dfdWdP.*(f.- f̄ - ((f̃/g̃)*(g.-ḡ)))), dims=1)'

dfdWdP is a Matrix{Float64}
f and g are Vector{Float64}
f,g bar and f,g tilde are Float64

vchuravy · February 24, 2024, 2:16pm

The issue is that the sum forces the materialization of the broadcast expression.

You might be able to use something like LazyArrays.jl to delay the broadcast expression and fuse the sum.

DNF · February 24, 2024, 3:41pm

It looks like there’s something off about your formatting. It looked like you had forgotten some multiplication signs:

but in quotes * gets turned into italics. Use triple backticks instead.

Also I think that you have missed some dots, which will cause even more materialization and allocation. Try

J = sum((dfdWdP .* (f .- f̄ .- ((f̃/g̃).*(g.-ḡ)))); dims=1)

((There are probably too many parentheses in there, too)).

dennispureza · February 24, 2024, 4:17pm

I think the quotes ommited the multiplication signs.

Your suggestion on the dots really helped. I still get a lot of Materialization, but a little less than before. Thanks!

DNF · February 24, 2024, 4:39pm

Yeah, quotes are for text, and backticks for code. You can use single backticks for inline code:

mcabbott · February 24, 2024, 5:50pm

One way to avoid having broadcast allocating this matrix before summing it, is to give it temporary space to write into. If you do this many times, you can re-use the same space. This is J3! below.

I wrote a package for performing such reductions, just uses loops instead of broadcasting to a matrix at all – J4 below. You could probably get similar performance by just writing out for loops, by hand. If the arrays are large enough, the multi-threading will also help.

julia> J(dfdWdP, f, g, f̄=0.1, ḡ=0.2, f̃=0.3, g̃=0.4) = sum((dfdWdP.*(f.- f̄ - ((f̃/g̃)*(g.-ḡ)))), dims=1)';  # original

# from DNF, more dots:
julia> J2(dfdWdP, f, g, f̄=0.1, ḡ=0.2, f̃=0.3, g̃=0.4) = sum((dfdWdP .* (f .- f̄ .- ((f̃/g̃).*(g.-ḡ)))); dims=1)';

# broadcast into pre-allocated matrix:
julia> J3!(tmp, dfdWdP, f, g, f̄=0.1, ḡ=0.2, f̃=0.3, g̃=0.4) = sum(tmp .= (dfdWdP .* (f .- f̄ .- ((f̃/g̃).*(g.-ḡ)))); dims=1)';

julia> using Tullio, BenchmarkTools

# with @tullio, will multi-thread if large:
julia> J4(dfdWdP, f, g, f̄=0.1, ḡ=0.2, f̃=0.3, g̃=0.4) = @tullio J[j] := dfdWdP[i,j] * (f[i] - f̄ - (f̃/g̃)*(g[i]-ḡ))
J4 (generic function with 5 methods)

julia> for N in [10, 1000]
         @show N
         dfdWdP = rand(N,N)
         f = rand(N)
         g = rand(N)
         tmp = similar(dfdWdP)
         res1 = @btime J($dfdWdP, $f, $g)
         res2 = @btime J2($dfdWdP, $f, $g)
         res3 = @btime J3!($tmp, $dfdWdP, $f, $g)
         res4 = @btime J4($dfdWdP, $f, $g)
         @assert res1 ≈ res3 ≈ res4
       end
N = 10
  min 174.988 ns, mean 258.013 ns (12 allocations, 1.64 KiB)
  min 101.095 ns, mean 150.852 ns (4 allocations, 1.08 KiB)
  min 89.367 ns, mean 99.166 ns (2 allocations, 160 bytes)
  min 68.733 ns, mean 90.173 ns (3 allocations, 224 bytes)
N = 1000
  min 490.875 μs, mean 1.372 ms (18 allocations, 7.67 MiB)
  min 410.250 μs, mean 1.212 ms (6 allocations, 7.64 MiB)
  min 341.000 μs, mean 419.709 μs (3 allocations, 7.95 KiB)
  min 79.125 μs, mean 85.786 μs (57 allocations, 10.91 KiB)

Topic		Replies	Views
Is there way to avoid `materialize` and `copy`? Performance	20	3353	November 15, 2018
Sum, mapreduce and broadcasted Performance broadcast	14	3008	September 23, 2018
Unnecessary materialize in broadcast? Performance broadcast	9	1761	August 28, 2018
Unknown allocation Performance	5	593	February 10, 2020
Avoid temporary array creation for the argument of `reduce` General Usage	8	1536	August 31, 2019

Reducing Materialize in Profiling

Related topics