There does seem to be some problem inferring with these very complicated broadcasting expressions. On the CPU, I get a factor of 3 speedup & zero allocations by broadcasting one inner function.
I suppose these expressions are made up, but if your real ones have anything like this degree of repetition, removing it really helps. In all a factor 20:
julia> @btime for iter in 1:1
math1!($C, $A, $B) # original functions, on Arrays of original size
math2!($D, $C)
math3!($E, $D)
end
69.521 ms (30 allocations: 864 bytes)
julia> function math1i!(C, A, B)
inner(A,B) = A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B
C .= inner.(A, B)
return C
end
# and the others similarly
julia> @btime for iter in 1:1 #
...
19.776 ms (0 allocations: 0 bytes)
julia> using CommonSubexpressions
julia> function math1c!(C, A, B)
inner(A,B) = @cse A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B
C .= inner.(A, B)
return C
end
julia> @btime for iter in 1:1
...
3.322 ms (0 allocations: 0 bytes)
Trying those on a GPU, the effect of avoiding very complicated Broadcasted is strong, but the effect of CSE is weak. (Because more computation while working over the same memory once is roughly free.)
julia> @btime CUDA.@sync for iter in 1:1000
math1!($C, $A, $B) # from above
math2!($D, $C)
math3!($E, $D)
end
697.650 ms (1698001 allocations: 26.64 MiB)
julia> @btime CUDA.@sync for iter in 1:1000
math1i!($C, $A, $B) # simpler broadcast
math2i!($D, $C)
math3i!($E, $D)
end
186.972 ms (156001 allocations: 2.56 MiB)
julia> @btime CUDA.@sync for iter in 1:1000
math1c!($C, $A, $B) # with CSE
math2c!($D, $C)
math3c!($E, $D)
end
179.948 ms (156001 allocations: 2.56 MiB)
julia> CUDA.device()
CuDevice(0): Tesla V100-PCIE-16GB
I think âshort formulasâ here means more separate broadcasts. That will need more memory (or more pre-allocated containers) and making multiple passes is usually slower.
However, your âlong formulasâ appear to be long enough to cause inference to give up, or something? The fact that they do not run with zero allocations on the CPU is a bad sign. This is a separate problem, and it can probably be solved by asking less of the broadcasting machinery.
Edit, I see line-breaking was discussed a bit above too:
Splitting in two avoids the inference/allocation problem, but doesnât add much speed, as we need to go over the data twice. (I didnât try on GPU though, CPU details below.)
julia> function math1lmiq!(C, A, B) # two-pass function, suggested by @lmiq above
@. C = A^2 + B^2 + A * B
# correction, this needs to be += (and .+= has its own performance headaches)
@. C += A / B - A * B - A / B + A * B + A / B - A * B - A / B
return C
end
math1lmiq! (generic function with 1 method)
julia> function math2lmiq!(D, C) # same idea
@. D = C^2 + C^2 + C * C + C / C - C * C - C / C
@. D += C * C + C / C - C * C - C / C
return D
end
math2lmiq! (generic function with 1 method)
julia> function math3lmiq!(E, D) # same idea
@. E = D^2 + D^2 + D * D + D / D - D * D - D / D
@. E += D * D + D / D - D * D - D / D
return E
end
math3lmiq! (generic function with 1 method)
julia> @btime for iter = 1:1
math1lmiq!($C, $A, $B)
math2lmiq!($D, $C)
math3lmiq!($E, $D)
end
55.559 ms (0 allocations: 0 bytes)