Preventing broadcast fusing

pengwyn · February 2, 2019, 3:25am

I have found in a particular function of mine that it is fastest to perform a calculation in 4 steps. I can logically write this down as one line of code, but the fusing syntax of broadcasting ends up being too greedy and slowing it down.

Is there a way to fence off expressions of code to indicate a boundary for the fusing syntax?

The code in my case is something very similar to:

n = 50
vector = rand(n)
a = reshape(vector, n,1,1)
b = reshape(vector, 1,n,1)
c = reshape(vector, 1,1,n)

# What I would like to do:
out = exp.(a) .* exp.(b) .* exp.(c)

# What is fastest
temp_a = exp.(a)
temp_b = exp.(b)
temp_c = exp.(c)
out = temp_a .* temp_b .* temp_c

The speed difference in benchmarking is roughly a factor of 200 for the above example.

I can guess that the single-line version currently unnecessarily recalculates the exponential for a, b and c in every position of the output array, instead of only once for each row, etc… Is there anyway I could write
out = (@barrier exp.(a)) .* (@barrier exp.(b)) .* ...
instead? I guess I should mention that a is actually a more complicated expression itself with several operations fused together.

c42f · February 2, 2019, 3:46am

A normal (non-dotted) function call acts as a barrier so you could insert a few calls to identity I suppose

out = identity(exp.(a)) .* identity(exp.(b)) .* identity(exp.(c))

pengwyn · February 2, 2019, 4:13am

That works brilliantly! Thanks!

DNF · February 2, 2019, 7:09am

Why are you calculating exp of the same vector three times, instead of

a = exp.(vector)
b = reshape(a, 1, :)
c = reshape(a, 1, 1, :) 
out = a .* b .* c

?

Also, out is quite sparse in terms of information. Do you really need to blow it up in full 3D? Depending on what you need to do later, might you be able to avoid constructing it at all? I just ask because I often see constructions like these that are really redundant.

bennedich · February 2, 2019, 7:36am

That was a clever solution! But seeing code like that, it wouldn’t be obvious to me what’s going on. I’d probably opt for the 4 line version for more readability, especially since you say that “a is actually a more complicated expression itself with several operations fused together”.

Alternatively, I’d add a comment like “identity is used to prevent too eager fusing” – to explain what’s going on, and prevent some too clever programmer seeing your code deciding to remove the “unnecessary” identity methods.

I’m guessing it’s a minimal example and not the actual code?

DNF · February 2, 2019, 7:43am

It’s not a minimal example if it leaves obvious improvements on the table, imho. Also, very obvious optimizations in a question about optimization are really distracting.

DNF · February 2, 2019, 7:57am

The broadcasting @. macro has a splicing notation that uses $. It doesn’t apply to this type of case, though. At least I think so. Maybe something similar could be made to work somehow.

@. $(exp(a)) * $(exp(b)) * $(exp(c)) or @. exp(a)) $* exp(b) $* exp(c) (neither of these any good, or make total sense). What could be a good notation for this?

bennedich · February 2, 2019, 9:03am

If indeed it’s one and the same vector, there’s no need for reshape and multiple variables:

a = exp.(rand(50))
out = [i*j*k for i=a, j=a, k=a]

pengwyn · February 2, 2019, 9:22am

Ah yes @DNF I simplified too far in my example. Mathematically I’m actually just calculating the Gaussian A^3 exp(-|r - grid|^2 / 2width^2). So a more minimal example would be:

one_dim = range(-5, 5, length=51)
x = reshape(one_dim, :,1,1)
y = reshape(one_dim, 1,:,1)
z = reshape(one_dim, 1,1,:)

pos = [0.1,2.2,-3.5]
width = 0.5
out = identity(@. exp(-(x - pos[1])^2 / width^2 / 2)) .* 
      identity(@. exp(-(y - pos[2])^2 / width^2 / 2)) .* 
      identity(@. exp(-(z - pos[3])^2 / width^2 / 2))

Actually I try and not work explicitly with the 3 coordinates separately but my first attempt at this function was very slow, because I was calling exp on all grid points (not due to fusing but by first calculating r^2 = x^2 + y^2 + z^2). So I set to going through a more conventional approach to try and speed it up and came across the reason for my original question.

My code is now using prod instead, and this also breaks the fusion in broadcasting. But I know now how to handle these issues in the future!

Thanks @DNF for the reshape(..., :, ...) tip. I didn’t know that was possible.

@bennedich, it wasn’t too hard to read originally, because I had put it in 4 lines anyway but as connected lines with .* at the end of them. That way it was readable but I fell into the trap of everything fusing.

Mathematically, I can’t see a way out of needing all these values on the full grid. This minimal example is actually looped over for many different positions, which are accumulated into an array. After this, convolutions + thresholding are performed on the full array.

bennedich · February 2, 2019, 9:46am

An alternative way, avoiding the repeated code, reshapes, and identity “hack”:

r = range(-5, 5, length=51) # or -5:0.2:5
pos = [0.1, 2.2, -3.5]
width = 0.5
f = (x,p) -> exp(-(x - p)^2 / width^2 / 2)
out = [f(x, pos[1]) * f(y, pos[2]) * f(z, pos[3]) for x=r, y=r, z=r]

improbable22 · February 2, 2019, 10:47am

I’ve frequently been caught by the fact that broadcasting evaluates a function at every output point, not every value of its argument.

One alternative to the identity trick is to use a vectorised function instead of broadcasting exp, something like this:

using Yeppp: exp!
pre = -1 / width^2 /2
out = exp!(@. pre * (x - pos[1])^2 ) .* 
      exp!(@. pre * (y - pos[2])^2 ) .* 
      exp!(@. pre * (z - pos[3])^2 )

This seems more readable, and since the argument of each exp! creates a new array, you are free to over-write it without touching x. Instead of Yeppp you could just define your own function.

stevengj · February 2, 2019, 2:09pm

LICM for pure functions

opened 12:58PM - 20 Sep 18 UTC

SimonDanisch

optimizer feature

Julia 1.0: I just realized, that this is a gotcha one easily runs into, espec…ially when using the `@.` macro: ```Julia a = rand(1000, 1000) c = 1.0 out = similar(a) julia> @btime $(out) .= $a .- sin.($c); 5.695 ms (0 allocations: 0 bytes) julia> @btime $(out) .= $a .- sin($c); 495.669 μs (0 allocations: 0 bytes) ``` This likely happens because the compiler can't infer that sin is pure. I realize, with having access to the call tree in the new lazy broadcast, we could solve this for a predefined set of functions. First trick could be to just overload `broadcasted` for known signatures: ```Julia Base.Broadcast.broadcasted(::typeof(sin), x::Number) = sin(x) @btime $out .= $a .+ sin.($c) ``` this solves the problem for a chosen set of functions. We could also consider, if we introduce a purity trait to make this easier for multiple argument functions: ```Julia broadcasted(f, args...) = broadcasted(IsPure(f), f, args...) broadcasted(::Pure{true}, f, args...) = f(args...) # should probably not get applied to arrays broadcasted(::Pure{false}, f, args...) = Broadcasted(f, args...) ``` I guess this has been discussed before, but I couldn't really find an issue about it...

pengwyn · February 2, 2019, 9:14pm

While that has made it a bit easier to read @bennedich, you’ve effectively written out an explicit broadcast, assuming that the f(x,p) call is inlined. So that version suffers from the same issue as the broadcast fusion.

In other words, that version requires 3*50^3 calls of exp whereas the identity version only has 3*50 calls of exp. They both should have the same number of multiplications though.

(Somehow I’ve created a post, deleted it, and created a new post. All while intending to edit the original… oopes!)

bennedich · February 2, 2019, 9:25pm

Ah, sorry about that I was a bit quick and careless in posting that (and didn’t benchmark it either). It’s easily fixed though:

r = range(-5, 5, length=51) # or -5:0.2:5
pos = [0.1, 2.2, -3.5]
width = 0.5
f = p -> @. exp(-(r - p)^2 / width^2 / 2)
out3 = [x * y * z for x=f(pos[1]), y=f(pos[2]), z=f(pos[3])]

Benchmarking the two functions:

function version1()
    one_dim = range(-5, 5, length=51)
    x = reshape(one_dim, :,1,1)
    y = reshape(one_dim, 1,:,1)
    z = reshape(one_dim, 1,1,:)

    pos = [0.1,2.2,-3.5]
    width = 0.5
    out = identity(@. exp(-(x - pos[1])^2 / width^2 / 2)) .*
          identity(@. exp(-(y - pos[2])^2 / width^2 / 2)) .*
          identity(@. exp(-(z - pos[3])^2 / width^2 / 2))
end

function version2()
    r = range(-5, 5, length=51) # or -5:0.2:5
    pos = [0.1, 2.2, -3.5]
    width = 0.5
    f = p -> @. exp(-(r - p)^2 / width^2 / 2)
    out = [x * y * z for x=f(pos[1]), y=f(pos[2]), z=f(pos[3])]
end

With result:

julia> @btime version1();
  138.188 μs (12 allocations: 1.01 MiB)

julia> @btime version2();
  98.516 μs (15 allocations: 1.01 MiB)

julia> version1() ≈ version2()
true

pengwyn · February 2, 2019, 9:26pm

That’s interesting @stevengj, I didn’t know of the term LICM before. So given some optimisations to take account of pure functions, this kind of broadcast fusing would be handled at the loop level anyway?

@improbable22 good to know of that package!

pengwyn · February 2, 2019, 9:27pm

Ah but then isn’t the comprehension redundant? It looks simpler like this:

out3 = f(pos[1]) .* f(pos[2]) .* f(pos[3])

Oopes - I forgot that this hasn’t handled the dimensions…

c42f · February 3, 2019, 2:39am

All excellent points and I’d probably do the same in production code.

The identity trick is manual loop “unfusing” and forces the allocation of a temporary array, exactly like making the temporary explicitly on a separate line. Deciding on which parts of a broadcast would better be materialized early, stored and reused in a separate loop is an optimization which depends on pretty high level knowledge of the cost of memory allocation, memory bandwidth vs the cost of redoing some computation in the inner loop. And these things also completely depend on the size of the arrays involved. I don’t expect the julia compiler to do this any time soon and it also seems at odds with the simple definition of broadcasting which we have now as a fully fused operation.

If I understand LICM correctly it’s a much more local optimization which given a loop structure decides which parts may be hoisted out as loop invariants. It would definitely help here (if it’s not done already), as it allows the following transformation

for i=1:n
    for j=1:n
        for k=1:n
            out[i,j,k] = exp(x[i]) * exp(y[j]) * exp(z[k])
        end
    end
end

to

for i=1:n
    ex = exp(x[i])
    for j=1:n
        ey = exp(y[j])
        for k=1:n
            out[i,j,k] = ex * ey * exp(z[k])
        end
    end
end

But even so, you’ve still got O(n^3) invocations of exp, whereas allocating temporary storage and splitting the loops brings you down to O(n).

improbable22 · February 3, 2019, 12:17pm

Thanks, that summarises what I was failing to think through clearly about this LICM idea.

For matrix .* exp.(vector) such a re-organisation could call exp exactly n times, without needing to create a temporary array. But it would still need to know whether doing so is more or less important than reading matrix in column order.

Only for vector .* exp(scalar) (as in the issue linked) is it a certain benefit.

Topic		Replies	Views
fusing not faster Performance broadcast	7	1253	January 30, 2018
Newbie question about loop fusion and broadcasting New to Julia	9	985	June 12, 2020
In-place/fused broadcasting General Usage	10	636	February 22, 2019
Broadcast macro @. "escape" from it New to Julia	2	788	February 10, 2023
Speed of broadcast operations Performance	4	766	December 22, 2020

Preventing broadcast fusing

Related topics