 Bad performance in simple array broadcast operations

Hi,

I was rewriting some Matlab code into Julia, and I noticed that one function which used array broadcasting operations ran several times slower on Julia. I though this was really strange and I made a simple test function to test broadcasting operations in a few different ways and print the elapsed times. So, the test function is the following:

function test_broadcast(M::Int, N::Int, niter::Int=1000)
X = randn(M);

# Case A - simple broadcast
A = zeros(ComplexF64, M, N)
@time begin
for ii = 1:niter
A .+= exp.(im*X)
end
end

# Case B - broadcast with map
B = zeros(ComplexF64, M, N)
@time begin
for ii = 1:niter
B .+= map(x -> exp(x), im*X)
end
end

# Case C - broadcast using the broadcast! function
C = zeros(ComplexF64, M, N)
@time begin
for ii = 1:niter
broadcast!(+, C, C, exp.(im*X))
end
end

# Case D - broadcast using auxiliary variable
D = zeros(ComplexF64, M, N)
@time begin
for ii = 1:niter
Daux = exp.(im*X)
D .+= Daux
end
end

if !all((A .== B) .& (A .== C) .& (A .== D))
throw(ErrorException("Results mismatch"))
end
end

when executing this code using for M = 10000 and N = 10, I get the following results

1.826595 seconds (2.00 k allocations: 152.664 MiB, 0.13% gc time)
0.275579 seconds (5.00 k allocations: 305.344 MiB, 1.73% gc time)
0.268911 seconds (4.00 k allocations: 305.328 MiB, 1.74% gc time)
0.269613 seconds (4.00 k allocations: 305.328 MiB, 1.77% gc time)

So in this example the case A is roughly 7 times slower than all the other cases.

This code is really simple, I don’t know if I am missing anything. According to the documentation the case A should be equivalent to using the broadcast function. I tried this with Julia versions 1.1, 1.2, and 1.3-rc3, and the results are similar. To me this seems like a bug. I have searched open issues in github and there is this bug https://github.com/JuliaLang/julia/issues/28126 where it seems that there is a problem with broadcasting when using many arrays of the same size, but in my example I only have one array, so it seems that this is a different problem.

This is my output of versioninfo:

julia> versioninfo()
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
JULIA_EDITOR = "/usr/share/code/code"

Any idea of why the case A is much worse?

You are breaking the broadcast fusion by not proadcasting the multiplication by im. Try to eother broadcast it or put the multiplication by im into the function you’re mapping with.

While true, the problem here is different. With the broadcasting, the exp is computed once for every element in A, that is, (M * N) times (it has to be, the += is fused with the call). In the other cases, exp is computed only M times and the result is then added to all the columns.

Just adding a print statement to exp, making the sizes a bit smaller gives:

julia> test_broadcast(3, 2, 1)
A
Calling exp
Calling exp
Calling exp
Calling exp
Calling exp
Calling exp

B
Calling exp
Calling exp
Calling exp

C
Calling exp
Calling exp
Calling exp

D
Calling exp
Calling exp
Calling exp
6 Likes

Ah, that was a really sneaky one! I could have stared at that for a long while before figuring that out

On an unrelated note, cis(x) should be a faster equivalent of exp(im * x) for real x

3 Likes

Ah, thank you very much!, I was really confused by this, now I understand how the dot operator and the fusing of operations works.

Also thanks to stevengj for suggesting the cis function, I didn’t know about it. I have tested it and it indeed runs a bit faster.