Consider the following code doing a simple computation:
function test_vectorized(a, a_old)
@. a[:] = a_old[:] * 2
end
function test_vectorized2(a, a_old)
@. a = a_old * 2
end
function test_devectorized(a, a_old)
nx = size(a, 1)
@inbounds @simd for i = 1:nx
a[i] = a_old[i] * 2
end
end
a = rand(100000000)
a1 = copy(a)
test_vectorized(a1, a)
@time test_vectorized(a1, a)
a2 = copy(a)
test_vectorized2(a2, a)
@time test_vectorized2(a2, a)
a3 = copy(a)
test_devectorized(a3, a)
@time test_devectorized(a3, a)
@assert isapprox(a1, a2)
@assert isapprox(a1, a3)
In Julia 0.6 the first vectorized function is about 5 times slower:
It would seem that there is enough information in test_vectorized() for the compiler to turn it into something similar to test_devectorized(), but, apparently, it is not the case. test_vectorized2(), on the other hand, shows the same performance. I guess the explicit indexing in test_vectorized() is the reason. Is there some way to improve the performance of fused loops in case of explicit indexing? In my actual program I am performing a stencil-type calculation, so I need a way to specify ranges (i.e., I have something like
Thanks, that did help a lot! The performance is the same for the simple operation from the opening post, and the performance of a stencil is close to the devectorized version, although still not exactly the same (~25% slower):
function test_vectorized(a, a_old)
@views @. a[2:end-1, 2:end-1] = (
(a_old[1:end-2, 2:end-1] + a_old[3:end, 2:end-1]) * 0.1
+ (a_old[2:end-1,1:end-2] + a_old[2:end-1, 3:end]) * 0.2)
end
function test_devectorized(a, a_old)
nx, ny = size(a)
@inbounds @simd for j = 2:(ny-1)
for i = 2:(nx-1)
a[i,j] = ((a_old[i-1, j] + a_old[i+1, j]) * 0.1
+ (a_old[i, j-1] + a_old[i, j+1]) * 0.2)
end
end
end
a = rand(10000, 10000)
a1 = copy(a)
test_vectorized(a1, a)
@time test_vectorized(a1, a)
a2 = copy(a)
test_devectorized(a2, a)
@time test_devectorized(a2, a)
@assert isapprox(a1, a2)
It seems that your first @time call is also timing the compilation of @time itself (or more precisely, of what it expands to), hence the much larger number of allocations. In general, it is recommended to use the package BenchmarkTools.jl for benchmarking, as follows:
julia> using BenchmarkTools
julia> @btime test_vectorized($a1, $a);
215.568 ms (4 allocations: 256 bytes)
julia> @btime test_devectorized($a2, $a);
187.499 ms (0 allocations: 0 bytes)
So there is indeed some overhead to the vectorized/views version, which I think currently cannot be avoided, because each view needs an allocation (although a small one).
@nalimilan, note the @. in front of the expression, it converts operators into broadcasted form automatically. The question was about the difference between a and a[:].
@views @. is fully inplace. Asymtopically it’ll do as well as the devectorized form. Currently it has overhead because views are not stack-allocated, but that’s just a missing compiler optimization.
I agree there’s no reason to a[:] anymore, but a .= @. ... seems weird and is no more efficient than @. a = ...
@. is useful if you have a lot of dotted operators in the expression (I have only two, so it’s not that evident, but this example was reduced from a bigger one; see my post with the stencil code).
[:] is crucial to the question; I was worried about the performance of slices. Of course, a and a[:] give the same result (for a 1D array), but even a trivial slice [:] still reproduces the issue with performance. It’s a MWE, I tried to make it as simple as possible. In my actual code I have non-trivial slices, of course.
Glad to hear it, I was worried that I missed some kind of compiler hint. Although I do not really see how several small allocations can take so much time. I already have a 10^10-element array, when does it start to even out?
I honestly do not see this construction used anywhere in this thread. I only have @. a = ....