How to avoid memory allocation when having long expression

const ⊕ = +
const ⊗ = *
    @. dTA_dH +=
        -KH2PO4 ⊗ KH3PO4 ⊗ TH3PO4 ⊗
        (2 ⊗ H^3 ⊕ H^2 ⊗ KH3PO4 - KH2PO4 ⊗ KH3PO4 ⊗ KHPO4) /
        (H^3 ⊕ H^2 ⊗ KH3PO4 ⊕ H ⊗ KH2PO4 ⊗ KH3PO4 ⊕ KH2PO4 ⊗ KH3PO4 ⊗ KHPO4)^2

All the variables here are either global constants or preallocated.
Earlier I learned that using + or * cause memory allocation in long expression so I start to use and , which did the job until this expression. I don’t understand why this still allocates memory.

const ⊕ = +
const ⊗ = *

will not change anything regarding memory. This is doing nothing to change the way your code works.

If you could provide a full MWE and how you are measuring memory usage, that would be very useful.

1 Like

I was told that + caused temporary allocations because its n-ary, and makes it associative so there’s no temporary allocation. It does work well for me.
Here is the MWE


dTA_dH = zeros(100)
H = zeros(100)

function test(dTA_dH, H)
    @. dTA_dH +=
        -1.0 ⊗ 1.0 ⊗ 1.0 ⊗ (2 ⊗ H^3 ⊕ H^2 ⊗ 1.0 - 1.0 ⊗ 1.0 ⊗ 1.0) /
        (H^3 ⊕ H^2 ⊗ 1.0 ⊕ H ⊗ 1.0 ⊗ 1.0 ⊕ 1.0 ⊗ 1.0 ⊗ 1.0)^2
end

@benchmark test($dTA_dH, $H)
BenchmarkTools.Trial:
  memory estimate:  80 bytes
  allocs estimate:  10
  --------------
  minimum time:     331.416 ns (0.00% GC)
  median time:      353.097 ns (0.00% GC)
  mean time:        376.715 ns (2.13% GC)
  maximum time:     81.230 μs (98.97% GC)
  --------------
  samples:          10000
  evals/sample:     226

Like I said, using + allocates even more

function test2(dTA_dH, H)
    @. dTA_dH +=
        -1.0 * 1.0 * 1.0 * (2 * H^3 + H^2 * 1.0 - 1.0 * 1.0 * 1.0) /
        (H^3 + H^2 * 1.0 + H * 1.0 * 1.0 + 1.0 * 1.0 * 1.0)^2
end

@benchmark test2($dTA_dH, $H)
BenchmarkTools.Trial:
  memory estimate:  256 bytes
  allocs estimate:  15
  --------------
  minimum time:     392.040 ns (0.00% GC)
  median time:      423.881 ns (0.00% GC)
  mean time:        459.663 ns (4.76% GC)
  maximum time:     77.091 μs (99.23% GC)
  --------------
  samples:          10000
  evals/sample:     201

Huh you are totally right

julia> t = :(a * b * c)
:(a * b * c)

julia> dump(t)
Expr
  head: Symbol call
  args: Array{Any}((4,))
    1: Symbol *
    2: Symbol a
    3: Symbol b
    4: Symbol c

julia> t = :(a ⊕ b ⊕ c)
:((a ⊕ b) ⊕ c)

julia> dump(t)
Expr
  head: Symbol call
  args: Array{Any}((3,))
    1: Symbol ⊕
    2: Expr
      head: Symbol call
      args: Array{Any}((3,))
        1: Symbol ⊕
        2: Symbol a
        3: Symbol b
    3: Symbol c

I don’t know what’s going on. Hopefully someone more informed can comment.

1 Like

It’s not the operators +,* or ⊕,⊗ which gives different allocations, but the parantheses:

julia> @macroexpand @. dTA_dH +=
               -1.0 ⊗ 1.0 ⊗ 1.0 ⊗ (2 ⊗ H^3 ⊕ H^2 ⊗ 1.0 - 1.0 ⊗ 1.0 ⊗ 1.0) /
               (H^3 ⊕ H^2 ⊗ 1.0 ⊕ H ⊗ 1.0 ⊗ 1.0 ⊕ 1.0 ⊗ 1.0 ⊗ 1.0)^2
:(dTA_dH .+= (/).((⊗).((⊗).((⊗).(-1.0, 1.0), 1.0), (-).((⊕).((⊗).(2, (^).(H, 3)), (⊗).((^).(H, 2), 1.0)), (⊗).((⊗).(1.0, 1.0), 1.0))), (^).((⊕).((⊕).((⊕).((^).(H, 3), (⊗).((^).(H, 2), 1.0)), (⊗).((⊗).(H, 1.0), 1.0)), (⊗).((⊗).(1.0, 1.0), 1.0)), 2)))

julia> function test2(dTA_dH, H)
          dTA_dH .+= (/).((⊗).((⊗).((⊗).(-1.0, 1.0), 1.0), (-).((⊕).((⊗).(2, (^).(H, 3)), (⊗).((^).(H, 2), 1.0)), (⊗).((⊗).(1.0, 1.0), 1.0))), (^).((⊕).((⊕).((⊕).((^).(H, 3), (⊗).((^).(H, 2), 1.0)), (⊗).((⊗).(H, 1.0), 1.0)), (⊗).((⊗).(1.0, 1.0), 1.0)), 2))
       end
test2 (generic function with 1 method)

julia> @btime test2($dTA_dH, $H);
  298.513 ns (10 allocations: 80 bytes)

## Now changing ⊗ to * and ⊕ to + in above test2 for definition of test:

julia> function test(dTA_dH, H)
          dTA_dH .+= (/).((*).((*).((*).(-1.0, 1.0), 1.0), (-).((+).((*).(2, (^).(H, 3)), (*).((^).(H, 2), 1.0)), (*).((*).(1.0, 1.0), 1.0))), (^).((+).((+).((+).((^).(H, 3), (*).((^).(H, 2), 1.0)), (*).((*).(H, 1.0), 1.0)), (*).((*).(1.0, 1.0), 1.0)), 2))
       end
test (generic function with 1 method)

julia> @btime test($dTA_dH, $H);
  298.092 ns (10 allocations: 80 bytes)

With above example you can see, that using exact the same parantheses in test and test2 and only different symbols for operators results in the same allocations.

You can check the other way round starting with:

julia> @macroexpand @. dTA_dH +=
               -1.0 * 1.0 * 1.0 * (2 * H^3 + H^2 * 1.0 - 1.0 * 1.0 * 1.0) /
               (H^3 + H^2 * 1.0 + H * 1.0 * 1.0 + 1.0 * 1.0 * 1.0)^2
:(dTA_dH .+= (/).((*).(-1.0, 1.0, 1.0, (-).((+).((*).(2, (^).(H, 3)), (*).((^).(H, 2), 1.0)), (*).(1.0, 1.0, 1.0))), (^).((+).((^).(H, 3), (*).((^).(H, 2), 1.0), (*).(H, 1.0, 1.0), (*).(1.0, 1.0, 1.0)), 2)))

and you see, that both functions now allocate the higher amount.

So, the allocations needed depend on the order the operations are performed.

But how to optimize this is quite a bit over my horizon.

To be clear, the operator choice does matter because + and * are given special treatment by the parser. The fact that you’ve done const ⊕ = + doesn’t make those two symbols equivalent from the parser’s perspective, because the parser doesn’t actually care about what anything means, just what symbols and values comprise it. That’s why you see different dump outputs from the two different symbols, @pdeffebach .

The relevant part of the parser is here: julia/julia-parser.scm at ecea238e5577c1b4a83de4c3ec4de1a13cdf56a1 · JuliaLang/julia · GitHub

It’s also correct that you can override the parser’s behavior by manually adding parentheses, in which case the behavior of both operators will be exactly the same.

3 Likes

Of course this does not solve the broadcasting issue, but expanding the loop of course solves that:

julia> function test2(dTA_dH, H)
           for i in eachindex(dTA_dH)
              dTA_dH[i] = -1.0 ⊗ 1.0 ⊗ 1.0 ⊗ (2 ⊗ H[i]^3 ⊕ H[i]^2 ⊗ 1.0 - 1.0 ⊗ 1.0 ⊗ 1.0) /
                      (H[i]^3 ⊕ H[i]^2 ⊗ 1.0 ⊕ H[i] ⊗ 1.0 ⊗ 1.0 ⊕ 1.0 ⊗ 1.0 ⊗ 1.0)^2
           end
       end
test2 (generic function with 1 method)

julia> @btime test2($dTA_dH,$H);
  175.491 ns (0 allocations: 0 bytes)


(and the result is the same with + and *, just to be explicit)

note: adding @inbounds to that loop brings the time down to 71ns.

2 Likes

I know writing a for loop would not allocate. But that kind defeats the purpose, right?
Does anybody have a simple solution to this problem, with minimal code modification? Such expressions are really common in my code, I don’t want to rewrite everything.

Yes. But then with it should be parsed “correctly”. Why are there still allocations?

is parsed differently from +, so when you do dump you see that, and that is equivalent to adding the parenthesis you did.

julia> using FastBroadcast, BenchmarkHistograms

julia> dTA_dH = zeros(100);

julia> H = zeros(100);

julia> const ⊕ = +
+ (generic function with 378 methods)

julia> const ⊗ = *
* (generic function with 590 methods)

julia> function test(dTA_dH, H)
           @.. dTA_dH +=
               -1.0 ⊗ 1.0 ⊗ 1.0 ⊗ (2 ⊗ H^3 ⊕ H^2 ⊗ 1.0 - 1.0 ⊗ 1.0 ⊗ 1.0) /
               (H^3 ⊕ H^2 ⊗ 1.0 ⊕ H ⊗ 1.0 ⊗ 1.0 ⊕ 1.0 ⊗ 1.0 ⊗ 1.0)^2
       end
test (generic function with 1 method)

julia> @benchmark test($dTA_dH, $H)
samples: 10000; evals/sample: 967; memory estimate: 0 bytes; allocs estimate: 0
ns

 (81.75 - 82.04 ]  ██████████████████████████████ 9191
 (82.04 - 82.33 ]   0
 (82.33 - 82.62 ]  ▏1
 (82.62 - 82.91 ]  ▌124
 (82.91 - 83.2  ]  ██600
 (83.2  - 83.49 ]  ▎48
 (83.49 - 83.78 ]  ▏18
 (83.78 - 84.07 ]  ▏1
 (84.07 - 84.36 ]  ▏1
 (84.36 - 84.65 ]  ▏2
 (84.65 - 84.94 ]  ▏1
 (84.94 - 85.23 ]  ▏1
 (85.23 - 85.52 ]   0
 (85.52 - 85.81 ]  ▏2
 (85.81 - 124.26]  ▏10

                  Counts

min: 81.751 ns (0.00% GC); mean: 81.943 ns (0.00% GC); median: 81.840 ns (0.00% GC); max: 124.259 ns (0.00% GC).

Note the extra . in @.. from FastBroadcast.jl.

4 Likes

Great! Why isn’t this the DEFAULT/Base in Julia?!

I was quite sure that @Elrod will chime in :wink: :+1: :muscle:

1 Like

Yingbo and I wrote it a few weeks ago.
It should (i.e., barring bugs) work wherever regular broadcasting works.

FastBroadcast.jl was designed to avoid these sorts of problems base broadcasting faces, being able to scale up to much larger expressions without facing problems like the allocations here or failing to SIMD that plague base broadcasting.

However, FastBroadcast.jl only tries to be fast if you’re not actually broadcasting dynamically.
That is, it tries to make rand(4,5) .+ rand(4,5) .* rand(4) ./ rand(5)' fast (the two broadcasts here, rand(4) across columns and rand(5)' across rows, are both known to be broadcasting at compile time as their size-1 axis are compile time constants).

But it doesn’t try to make rand(4,5) .+ rand(1,5) fast. FastBroadcast is used mostly by the DifferentialEquations ecosystem that doesn’t really have broadcasts like this, so treating these as a fairly cold case in favor of compile times made sense.

2 Likes

Nice @elrod, that fast broadcasting in this simple example is equivalent (in performance) to the explicit loop with @inbouds, at least here. Does that deserve any further comment? (Will it be faster or slower in other situations?).

It should generally nearly-match @inbounds for ... loops.
It uses an @generated function to basically create:

if # check that all dynamic sizes match (so no 1s vs >1s)
    @inbounds for ... # fast loops indexed normally
        # all indexes look like:
        #  `x[i]`
    end
else
    # slow loops that handle broadcasting, e.g.
    for i in ...
        # all indexes look like:
        # `x[ifelse(size(x,1) == 1, firstindex(A,1), i)]; end`
    end
end

It doesn’t do anything fancy for optimizations, just relying on Julia/LLVM there.
So (for now) any explicitly written loops should be at least as fast, thanks to avoid the need for the check.

We’ll probably add Polyester threading support and also talked about maybe subsituting in SIMD special functions, but FastBroadcast is so fast to load at the moment, it’d be a shame to make it heavy.

3 Likes