Hello, I found that if exists assignment, the point-wise loop will introduce lots of additional allocation (on Julia 1.0),
julia> function f1(a, b, c)
@. t1 = a * b / c
end
f1 (generic function with 1 method)
julia> function f2(a, b, c)
for j = 1: size(a, 2), i = 1: size(a, 1)
t1[i,j] = a[i,j] * b[i,j] / c[i,j]
end
end
f2 (generic function with 1 method)
julia> t1, x1, x2, x3 = [rand(10000, 5000) for _ in 1: 4];
julia> f1(x1, x2, x3); f2(x1, x2, x3);
julia> @time f1(x1, x2, x3);
0.095687 seconds (8 allocations: 256 bytes)
julia> @time f2(x1, x2, x3);
2.002025 seconds (142.34 M allocations: 2.121 GiB, 3.33% gc time)
As in comparison, the point-wise loop beats the broadcast fusion if no assignment:
julia> function g1(a, b, c)
@. a * b / c
end
g1 (generic function with 1 method)
julia> function g2(a, b, c)
for j = 1: size(a, 2), i = 1: size(a, 1)
a[i,j] * b[i,j] / c[i,j]
end
end
g2 (generic function with 1 method)
julia> g1(x1, x2, x3); g2(x1, x2, x3);
julia> @time g1(x1, x2, x3);
0.299063 seconds (6 allocations: 381.470 MiB, 9.51% gc time)
julia> @time g2(x1, x2, x3);
0.033901 seconds (4 allocations: 160 bytes)
Is there anything I am doing improperly? I may need the point-wise loop for implementing SharedArray-based parallelism. Thank you!