Preventing broadcast fusing

broadcast

#1

I have found in a particular function of mine that it is fastest to perform a calculation in 4 steps. I can logically write this down as one line of code, but the fusing syntax of broadcasting ends up being too greedy and slowing it down.

Is there a way to fence off expressions of code to indicate a boundary for the fusing syntax?

The code in my case is something very similar to:

n = 50
vector = rand(n)
a = reshape(vector, n,1,1)
b = reshape(vector, 1,n,1)
c = reshape(vector, 1,1,n)

# What I would like to do:
out = exp.(a) .* exp.(b) .* exp.(c)

# What is fastest
temp_a = exp.(a)
temp_b = exp.(b)
temp_c = exp.(c)
out = temp_a .* temp_b .* temp_c

The speed difference in benchmarking is roughly a factor of 200 for the above example.

I can guess that the single-line version currently unnecessarily recalculates the exponential for a, b and c in every position of the output array, instead of only once for each row, etc… Is there anyway I could write
out = (@barrier exp.(a)) .* (@barrier exp.(b)) .* ...
instead? I guess I should mention that a is actually a more complicated expression itself with several operations fused together.


#2

A normal (non-dotted) function call acts as a barrier so you could insert a few calls to identity I suppose :slight_smile:

out = identity(exp.(a)) .* identity(exp.(b)) .* identity(exp.(c))

#3

That works brilliantly! Thanks!


#4

Why are you calculating exp of the same vector three times, instead of

a = exp.(vector)
b = reshape(a, 1, :)
c = reshape(a, 1, 1, :) 
out = a .* b .* c

?

Also, out is quite sparse in terms of information. Do you really need to blow it up in full 3D? Depending on what you need to do later, might you be able to avoid constructing it at all? I just ask because I often see constructions like these that are really redundant.


#5

That was a clever solution! But seeing code like that, it wouldn’t be obvious to me what’s going on. I’d probably opt for the 4 line version for more readability, especially since you say that "a is actually a more complicated expression itself with several operations fused together".

Alternatively, I’d add a comment like “identity is used to prevent too eager fusing” – to explain what’s going on, and prevent some too clever programmer seeing your code deciding to remove the “unnecessary” identity methods.

I’m guessing it’s a minimal example and not the actual code?


#6

It’s not a minimal example if it leaves obvious improvements on the table, imho. Also, very obvious optimizations in a question about optimization are really distracting.


#7

The broadcasting @. macro has a splicing notation that uses $. It doesn’t apply to this type of case, though. At least I think so. Maybe something similar could be made to work somehow.

@. $(exp(a)) * $(exp(b)) * $(exp(c)) or @. exp(a)) $* exp(b) $* exp(c) (neither of these any good, or make total sense). What could be a good notation for this?


#8

If indeed it’s one and the same vector, there’s no need for reshape and multiple variables:

a = exp.(rand(50))
out = [i*j*k for i=a, j=a, k=a]

#9

Ah yes @DNF I simplified too far in my example. Mathematically I’m actually just calculating the Gaussian A^3 exp(-|r - grid|^2 / 2width^2). So a more minimal example would be:

one_dim = range(-5, 5, length=51)
x = reshape(one_dim, :,1,1)
y = reshape(one_dim, 1,:,1)
z = reshape(one_dim, 1,1,:)

pos = [0.1,2.2,-3.5]
width = 0.5
out = identity(@. exp(-(x - pos[1])^2 / width^2 / 2)) .* 
      identity(@. exp(-(y - pos[2])^2 / width^2 / 2)) .* 
      identity(@. exp(-(z - pos[3])^2 / width^2 / 2))

Actually I try and not work explicitly with the 3 coordinates separately but my first attempt at this function was very slow, because I was calling exp on all grid points (not due to fusing but by first calculating r^2 = x^2 + y^2 + z^2). So I set to going through a more conventional approach to try and speed it up and came across the reason for my original question.

My code is now using prod instead, and this also breaks the fusion in broadcasting. But I know now how to handle these issues in the future!

Thanks @DNF for the reshape(..., :, ...) tip. I didn’t know that was possible.

@bennedich, it wasn’t too hard to read originally, because I had put it in 4 lines anyway but as connected lines with .* at the end of them. That way it was readable but I fell into the trap of everything fusing.

Mathematically, I can’t see a way out of needing all these values on the full grid. This minimal example is actually looped over for many different positions, which are accumulated into an array. After this, convolutions + thresholding are performed on the full array.


#10

An alternative way, avoiding the repeated code, reshapes, and identity “hack”:

r = range(-5, 5, length=51) # or -5:0.2:5
pos = [0.1, 2.2, -3.5]
width = 0.5
f = (x,p) -> exp(-(x - p)^2 / width^2 / 2)
out = [f(x, pos[1]) * f(y, pos[2]) * f(z, pos[3]) for x=r, y=r, z=r]

#11

I’ve frequently been caught by the fact that broadcasting evaluates a function at every output point, not every value of its argument.

One alternative to the identity trick is to use a vectorised function instead of broadcasting exp, something like this:

using Yeppp: exp!
pre = -1 / width^2 /2
out = exp!(@. pre * (x - pos[1])^2 ) .* 
      exp!(@. pre * (y - pos[2])^2 ) .* 
      exp!(@. pre * (z - pos[3])^2 )

This seems more readable, and since the argument of each exp! creates a new array, you are free to over-write it without touching x. Instead of Yeppp you could just define your own function.


#12

See also the discussion at this issue:


#14

While that has made it a bit easier to read @bennedich, you’ve effectively written out an explicit broadcast, assuming that the f(x,p) call is inlined. So that version suffers from the same issue as the broadcast fusion.

In other words, that version requires 3*50^3 calls of exp whereas the identity version only has 3*50 calls of exp. They both should have the same number of multiplications though.

(Somehow I’ve created a post, deleted it, and created a new post. All while intending to edit the original… oopes!)


#15

Ah, sorry about that I was a bit quick and careless in posting that (and didn’t benchmark it either). It’s easily fixed though:

r = range(-5, 5, length=51) # or -5:0.2:5
pos = [0.1, 2.2, -3.5]
width = 0.5
f = p -> @. exp(-(r - p)^2 / width^2 / 2)
out3 = [x * y * z for x=f(pos[1]), y=f(pos[2]), z=f(pos[3])]

Benchmarking the two functions:

function version1()
    one_dim = range(-5, 5, length=51)
    x = reshape(one_dim, :,1,1)
    y = reshape(one_dim, 1,:,1)
    z = reshape(one_dim, 1,1,:)

    pos = [0.1,2.2,-3.5]
    width = 0.5
    out = identity(@. exp(-(x - pos[1])^2 / width^2 / 2)) .*
          identity(@. exp(-(y - pos[2])^2 / width^2 / 2)) .*
          identity(@. exp(-(z - pos[3])^2 / width^2 / 2))
end

function version2()
    r = range(-5, 5, length=51) # or -5:0.2:5
    pos = [0.1, 2.2, -3.5]
    width = 0.5
    f = p -> @. exp(-(r - p)^2 / width^2 / 2)
    out = [x * y * z for x=f(pos[1]), y=f(pos[2]), z=f(pos[3])]
end

With result:

julia> @btime version1();
  138.188 μs (12 allocations: 1.01 MiB)

julia> @btime version2();
  98.516 μs (15 allocations: 1.01 MiB)

julia> version1() ≈ version2()
true

#16

That’s interesting @stevengj, I didn’t know of the term LICM before. So given some optimisations to take account of pure functions, this kind of broadcast fusing would be handled at the loop level anyway?

@improbable22 good to know of that package!


#17

Ah but then isn’t the comprehension redundant? It looks simpler like this:

out3 = f(pos[1]) .* f(pos[2]) .* f(pos[3])

Oopes - I forgot that this hasn’t handled the dimensions…


#18

All excellent points and I’d probably do the same in production code.

The identity trick is manual loop “unfusing” and forces the allocation of a temporary array, exactly like making the temporary explicitly on a separate line. Deciding on which parts of a broadcast would better be materialized early, stored and reused in a separate loop is an optimization which depends on pretty high level knowledge of the cost of memory allocation, memory bandwidth vs the cost of redoing some computation in the inner loop. And these things also completely depend on the size of the arrays involved. I don’t expect the julia compiler to do this any time soon and it also seems at odds with the simple definition of broadcasting which we have now as a fully fused operation.

If I understand LICM correctly it’s a much more local optimization which given a loop structure decides which parts may be hoisted out as loop invariants. It would definitely help here (if it’s not done already), as it allows the following transformation

for i=1:n
    for j=1:n
        for k=1:n
            out[i,j,k] = exp(x[i]) * exp(y[j]) * exp(z[k])
        end
    end
end

to

for i=1:n
    ex = exp(x[i])
    for j=1:n
        ey = exp(y[j])
        for k=1:n
            out[i,j,k] = ex * ey * exp(z[k])
        end
    end
end

But even so, you’ve still got O(n^3) invocations of exp, whereas allocating temporary storage and splitting the loops brings you down to O(n).


#19

Thanks, that summarises what I was failing to think through clearly about this LICM idea.

For matrix .* exp.(vector) such a re-organisation could call exp exactly n times, without needing to create a temporary array. But it would still need to know whether doing so is more or less important than reading matrix in column order.

Only for vector .* exp(scalar) (as in the issue linked) is it a certain benefit.