How to avoid memory allocation when having long expression

JianghuiDu · May 28, 2021, 2:30pm

const ⊕ = +
const ⊗ = *
    @. dTA_dH +=
        -KH2PO4 ⊗ KH3PO4 ⊗ TH3PO4 ⊗
        (2 ⊗ H^3 ⊕ H^2 ⊗ KH3PO4 - KH2PO4 ⊗ KH3PO4 ⊗ KHPO4) /
        (H^3 ⊕ H^2 ⊗ KH3PO4 ⊕ H ⊗ KH2PO4 ⊗ KH3PO4 ⊕ KH2PO4 ⊗ KH3PO4 ⊗ KHPO4)^2

All the variables here are either global constants or preallocated.
Earlier I learned that using + or * cause memory allocation in long expression so I start to use ⊕ and ⊗, which did the job until this expression. I don’t understand why this still allocates memory.

pdeffebach · May 28, 2021, 2:34pm

const ⊕ = +
const ⊗ = *

will not change anything regarding memory. This is doing nothing to change the way your code works.

If you could provide a full MWE and how you are measuring memory usage, that would be very useful.

JianghuiDu · May 28, 2021, 2:45pm

I was told that + caused temporary allocations because its n-ary, and ⊕ makes it associative so there’s no temporary allocation. It does work well for me.
Here is the MWE


dTA_dH = zeros(100)
H = zeros(100)

function test(dTA_dH, H)
    @. dTA_dH +=
        -1.0 ⊗ 1.0 ⊗ 1.0 ⊗ (2 ⊗ H^3 ⊕ H^2 ⊗ 1.0 - 1.0 ⊗ 1.0 ⊗ 1.0) /
        (H^3 ⊕ H^2 ⊗ 1.0 ⊕ H ⊗ 1.0 ⊗ 1.0 ⊕ 1.0 ⊗ 1.0 ⊗ 1.0)^2
end

@benchmark test($dTA_dH, $H)
BenchmarkTools.Trial:
  memory estimate:  80 bytes
  allocs estimate:  10
  --------------
  minimum time:     331.416 ns (0.00% GC)
  median time:      353.097 ns (0.00% GC)
  mean time:        376.715 ns (2.13% GC)
  maximum time:     81.230 μs (98.97% GC)
  --------------
  samples:          10000
  evals/sample:     226

JianghuiDu · May 28, 2021, 2:47pm

Like I said, using + allocates even more

function test2(dTA_dH, H)
    @. dTA_dH +=
        -1.0 * 1.0 * 1.0 * (2 * H^3 + H^2 * 1.0 - 1.0 * 1.0 * 1.0) /
        (H^3 + H^2 * 1.0 + H * 1.0 * 1.0 + 1.0 * 1.0 * 1.0)^2
end

@benchmark test2($dTA_dH, $H)
BenchmarkTools.Trial:
  memory estimate:  256 bytes
  allocs estimate:  15
  --------------
  minimum time:     392.040 ns (0.00% GC)
  median time:      423.881 ns (0.00% GC)
  mean time:        459.663 ns (4.76% GC)
  maximum time:     77.091 μs (99.23% GC)
  --------------
  samples:          10000
  evals/sample:     201

pdeffebach · May 28, 2021, 3:00pm

Huh you are totally right

julia> t = :(a * b * c)
:(a * b * c)

julia> dump(t)
Expr
  head: Symbol call
  args: Array{Any}((4,))
    1: Symbol *
    2: Symbol a
    3: Symbol b
    4: Symbol c

julia> t = :(a ⊕ b ⊕ c)
:((a ⊕ b) ⊕ c)

julia> dump(t)
Expr
  head: Symbol call
  args: Array{Any}((3,))
    1: Symbol ⊕
    2: Expr
      head: Symbol call
      args: Array{Any}((3,))
        1: Symbol ⊕
        2: Symbol a
        3: Symbol b
    3: Symbol c

I don’t know what’s going on. Hopefully someone more informed can comment.

oheil · May 28, 2021, 3:52pm

It’s not the operators +,* or ⊕,⊗ which gives different allocations, but the parantheses:

julia> @macroexpand @. dTA_dH +=
               -1.0 ⊗ 1.0 ⊗ 1.0 ⊗ (2 ⊗ H^3 ⊕ H^2 ⊗ 1.0 - 1.0 ⊗ 1.0 ⊗ 1.0) /
               (H^3 ⊕ H^2 ⊗ 1.0 ⊕ H ⊗ 1.0 ⊗ 1.0 ⊕ 1.0 ⊗ 1.0 ⊗ 1.0)^2
:(dTA_dH .+= (/).((⊗).((⊗).((⊗).(-1.0, 1.0), 1.0), (-).((⊕).((⊗).(2, (^).(H, 3)), (⊗).((^).(H, 2), 1.0)), (⊗).((⊗).(1.0, 1.0), 1.0))), (^).((⊕).((⊕).((⊕).((^).(H, 3), (⊗).((^).(H, 2), 1.0)), (⊗).((⊗).(H, 1.0), 1.0)), (⊗).((⊗).(1.0, 1.0), 1.0)), 2)))

julia> function test2(dTA_dH, H)
          dTA_dH .+= (/).((⊗).((⊗).((⊗).(-1.0, 1.0), 1.0), (-).((⊕).((⊗).(2, (^).(H, 3)), (⊗).((^).(H, 2), 1.0)), (⊗).((⊗).(1.0, 1.0), 1.0))), (^).((⊕).((⊕).((⊕).((^).(H, 3), (⊗).((^).(H, 2), 1.0)), (⊗).((⊗).(H, 1.0), 1.0)), (⊗).((⊗).(1.0, 1.0), 1.0)), 2))
       end
test2 (generic function with 1 method)

julia> @btime test2($dTA_dH, $H);
  298.513 ns (10 allocations: 80 bytes)

## Now changing ⊗ to * and ⊕ to + in above test2 for definition of test:

julia> function test(dTA_dH, H)
          dTA_dH .+= (/).((*).((*).((*).(-1.0, 1.0), 1.0), (-).((+).((*).(2, (^).(H, 3)), (*).((^).(H, 2), 1.0)), (*).((*).(1.0, 1.0), 1.0))), (^).((+).((+).((+).((^).(H, 3), (*).((^).(H, 2), 1.0)), (*).((*).(H, 1.0), 1.0)), (*).((*).(1.0, 1.0), 1.0)), 2))
       end
test (generic function with 1 method)

julia> @btime test($dTA_dH, $H);
  298.092 ns (10 allocations: 80 bytes)

With above example you can see, that using exact the same parantheses in test and test2 and only different symbols for operators results in the same allocations.

You can check the other way round starting with:

julia> @macroexpand @. dTA_dH +=
               -1.0 * 1.0 * 1.0 * (2 * H^3 + H^2 * 1.0 - 1.0 * 1.0 * 1.0) /
               (H^3 + H^2 * 1.0 + H * 1.0 * 1.0 + 1.0 * 1.0 * 1.0)^2
:(dTA_dH .+= (/).((*).(-1.0, 1.0, 1.0, (-).((+).((*).(2, (^).(H, 3)), (*).((^).(H, 2), 1.0)), (*).(1.0, 1.0, 1.0))), (^).((+).((^).(H, 3), (*).((^).(H, 2), 1.0), (*).(H, 1.0, 1.0), (*).(1.0, 1.0, 1.0)), 2)))

and you see, that both functions now allocate the higher amount.

So, the allocations needed depend on the order the operations are performed.

But how to optimize this is quite a bit over my horizon.

rdeits · May 28, 2021, 4:07pm

To be clear, the operator choice does matter because + and * are given special treatment by the parser. The fact that you’ve done const ⊕ = + doesn’t make those two symbols equivalent from the parser’s perspective, because the parser doesn’t actually care about what anything means, just what symbols and values comprise it. That’s why you see different dump outputs from the two different symbols, @pdeffebach .

The relevant part of the parser is here: julia/julia-parser.scm at ecea238e5577c1b4a83de4c3ec4de1a13cdf56a1 · JuliaLang/julia · GitHub

It’s also correct that you can override the parser’s behavior by manually adding parentheses, in which case the behavior of both operators will be exactly the same.

lmiq · May 28, 2021, 4:13pm

Of course this does not solve the broadcasting issue, but expanding the loop of course solves that:

julia> function test2(dTA_dH, H)
           for i in eachindex(dTA_dH)
              dTA_dH[i] = -1.0 ⊗ 1.0 ⊗ 1.0 ⊗ (2 ⊗ H[i]^3 ⊕ H[i]^2 ⊗ 1.0 - 1.0 ⊗ 1.0 ⊗ 1.0) /
                      (H[i]^3 ⊕ H[i]^2 ⊗ 1.0 ⊕ H[i] ⊗ 1.0 ⊗ 1.0 ⊕ 1.0 ⊗ 1.0 ⊗ 1.0)^2
           end
       end
test2 (generic function with 1 method)

julia> @btime test2($dTA_dH,$H);
  175.491 ns (0 allocations: 0 bytes)

(and the result is the same with + and *, just to be explicit)

note: adding @inbounds to that loop brings the time down to 71ns.

JianghuiDu · May 28, 2021, 5:21pm

I know writing a for loop would not allocate. But that kind defeats the purpose, right?
Does anybody have a simple solution to this problem, with minimal code modification? Such expressions are really common in my code, I don’t want to rewrite everything.

JianghuiDu · May 28, 2021, 5:23pm

Yes. But then with ⊕ it should be parsed “correctly”. Why are there still allocations?

JianghuiDu · May 28, 2021, 5:24pm

⊕ is parsed differently from +, so when you do dump you see that, and that is equivalent to adding the parenthesis you did.

Elrod · May 28, 2021, 5:24pm

julia> using FastBroadcast, BenchmarkHistograms

julia> dTA_dH = zeros(100);

julia> H = zeros(100);

julia> const ⊕ = +
+ (generic function with 378 methods)

julia> const ⊗ = *
* (generic function with 590 methods)

julia> function test(dTA_dH, H)
           @.. dTA_dH +=
               -1.0 ⊗ 1.0 ⊗ 1.0 ⊗ (2 ⊗ H^3 ⊕ H^2 ⊗ 1.0 - 1.0 ⊗ 1.0 ⊗ 1.0) /
               (H^3 ⊕ H^2 ⊗ 1.0 ⊕ H ⊗ 1.0 ⊗ 1.0 ⊕ 1.0 ⊗ 1.0 ⊗ 1.0)^2
       end
test (generic function with 1 method)

julia> @benchmark test($dTA_dH, $H)
samples: 10000; evals/sample: 967; memory estimate: 0 bytes; allocs estimate: 0
ns

 (81.75 - 82.04 ]  ██████████████████████████████ 9191
 (82.04 - 82.33 ]   0
 (82.33 - 82.62 ]  ▏1
 (82.62 - 82.91 ]  ▌124
 (82.91 - 83.2  ]  ██600
 (83.2  - 83.49 ]  ▎48
 (83.49 - 83.78 ]  ▏18
 (83.78 - 84.07 ]  ▏1
 (84.07 - 84.36 ]  ▏1
 (84.36 - 84.65 ]  ▏2
 (84.65 - 84.94 ]  ▏1
 (84.94 - 85.23 ]  ▏1
 (85.23 - 85.52 ]   0
 (85.52 - 85.81 ]  ▏2
 (85.81 - 124.26]  ▏10

                  Counts

min: 81.751 ns (0.00% GC); mean: 81.943 ns (0.00% GC); median: 81.840 ns (0.00% GC); max: 124.259 ns (0.00% GC).

Note the extra . in @.. from FastBroadcast.jl.

JianghuiDu · May 28, 2021, 5:25pm

Great! Why isn’t this the DEFAULT/Base in Julia?!

oheil · May 28, 2021, 5:31pm

I was quite sure that @Elrod will chime in

Elrod · May 28, 2021, 5:39pm

Yingbo and I wrote it a few weeks ago.
It should (i.e., barring bugs) work wherever regular broadcasting works.

FastBroadcast.jl was designed to avoid these sorts of problems base broadcasting faces, being able to scale up to much larger expressions without facing problems like the allocations here or failing to SIMD that plague base broadcasting.

However, FastBroadcast.jl only tries to be fast if you’re not actually broadcasting dynamically.
That is, it tries to make rand(4,5) .+ rand(4,5) .* rand(4) ./ rand(5)' fast (the two broadcasts here, rand(4) across columns and rand(5)' across rows, are both known to be broadcasting at compile time as their size-1 axis are compile time constants).

But it doesn’t try to make rand(4,5) .+ rand(1,5) fast. FastBroadcast is used mostly by the DifferentialEquations ecosystem that doesn’t really have broadcasts like this, so treating these as a fairly cold case in favor of compile times made sense.

lmiq · May 28, 2021, 5:52pm

Nice @elrod, that fast broadcasting in this simple example is equivalent (in performance) to the explicit loop with @inbouds, at least here. Does that deserve any further comment? (Will it be faster or slower in other situations?).

Elrod · May 28, 2021, 6:08pm

It should generally nearly-match @inbounds for ... loops.
It uses an @generated function to basically create:

if # check that all dynamic sizes match (so no 1s vs >1s)
    @inbounds for ... # fast loops indexed normally
        # all indexes look like:
        #  `x[i]`
    end
else
    # slow loops that handle broadcasting, e.g.
    for i in ...
        # all indexes look like:
        # `x[ifelse(size(x,1) == 1, firstindex(A,1), i)]; end`
    end
end

It doesn’t do anything fancy for optimizations, just relying on Julia/LLVM there.
So (for now) any explicitly written loops should be at least as fast, thanks to avoid the need for the check.

We’ll probably add Polyester threading support and also talked about maybe subsituting in SIMD special functions, but FastBroadcast is so fast to load at the moment, it’d be a shame to make it heavy.

Topic		Replies	Views
Reducing allocations for long expressions General Usage question , memory-allocation , function	12	585	February 11, 2022
Memory allocation when using broadcasting macro @. with long expression? General Usage macros , broadcast	6	524	October 13, 2020
Memory allocation usage high with power (^) compared with x*x New to Julia	16	1317	January 3, 2017
Common allocation mistakes Performance memory-allocation	47	7133	August 21, 2023
Something like C++ static declaration for functions in Julia? New to Julia first-steps , performance	15	1120	June 3, 2020

How to avoid memory allocation when having long expression

Related topics