# Performance drop for explicit indexing in fused loops

Consider the following code doing a simple computation:

``````function test_vectorized(a, a_old)
@. a[:] = a_old[:] * 2
end

function test_vectorized2(a, a_old)
@. a = a_old * 2
end

function test_devectorized(a, a_old)
nx = size(a, 1)
@inbounds @simd for i = 1:nx
a[i] = a_old[i] * 2
end
end

a = rand(100000000)

a1 = copy(a)
test_vectorized(a1, a)
@time test_vectorized(a1, a)

a2 = copy(a)
test_vectorized2(a2, a)
@time test_vectorized2(a2, a)

a3 = copy(a)
test_devectorized(a3, a)
@time test_devectorized(a3, a)

@assert isapprox(a1, a2)
@assert isapprox(a1, a3)
``````

In Julia 0.6 the first vectorized function is about 5 times slower:

``````> julia test.jl
0.617197 seconds (86 allocations: 762.946 MiB, 17.37% gc time)
0.127845 seconds (4 allocations: 160 bytes)
0.125778 seconds (4 allocations: 160 bytes)
``````

It would seem that there is enough information in `test_vectorized()` for the compiler to turn it into something similar to `test_devectorized()`, but, apparently, it is not the case. `test_vectorized2()`, on the other hand, shows the same performance. I guess the explicit indexing in `test_vectorized()` is the reason. Is there some way to improve the performance of fused loops in case of explicit indexing? In my actual program I am performing a stencil-type calculation, so I need a way to specify ranges (i.e., I have something like

``````@. a[2:end-1, 2:end-1] = (
(a_old[1:end-2, 2:end-1] + a_old[3:end, 2:end-1]) * dyd
+ (a_old[2:end-1,1:end-2] + a_old[2:end-1, 3:end]) * dxd)
``````

)

1 Like
``````@views function test_vectorized3(a, a_old)
@. a[:] = a_old[:] * 2
end
julia> @time test_vectorized3(a2, a);
0.126813 seconds (5 allocations: 208 bytes)
``````

I think views will also work for your stencils.

Thanks, that did help a lot! The performance is the same for the simple operation from the opening post, and the performance of a stencil is close to the devectorized version, although still not exactly the same (~25% slower):

``````function test_vectorized(a, a_old)
@views @. a[2:end-1, 2:end-1] = (
(a_old[1:end-2, 2:end-1] + a_old[3:end, 2:end-1]) * 0.1
+ (a_old[2:end-1,1:end-2] + a_old[2:end-1, 3:end]) * 0.2)
end

function test_devectorized(a, a_old)
nx, ny = size(a)
@inbounds @simd for j = 2:(ny-1)
for i = 2:(nx-1)
a[i,j] = ((a_old[i-1, j] + a_old[i+1, j]) * 0.1
+ (a_old[i, j-1] + a_old[i, j+1]) * 0.2)
end
end
end

a = rand(10000, 10000)

a1 = copy(a)
test_vectorized(a1, a)
@time test_vectorized(a1, a)

a2 = copy(a)
test_devectorized(a2, a)
@time test_devectorized(a2, a)

@assert isapprox(a1, a2)
``````

This gives

``````> julia test.jl
0.218883 seconds (87 allocations: 6.623 KiB)
0.152084 seconds (4 allocations: 160 bytes)
``````

(without the `@views`, `test_vectorized()` is about 5 times slower).

It seems that your first `@time` call is also timing the compilation of `@time` itself (or more precisely, of what it expands to), hence the much larger number of allocations. In general, it is recommended to use the package `BenchmarkTools.jl` for benchmarking, as follows:

``````julia> using BenchmarkTools

julia> @btime test_vectorized(\$a1, \$a);
215.568 ms (4 allocations: 256 bytes)

julia> @btime test_devectorized(\$a2, \$a);
187.499 ms (0 allocations: 0 bytes)
``````

So there is indeed some overhead to the vectorized/views version, which I think currently cannot be avoided, because each view needs an allocation (although a small one).

Don’t benchmark in global scope.

On Julia 0.6, you can also use the new syntax `a .= a_old .* 2`, which should perform quite well as a replacement for `a[:] = a_old[:] * 2`.

@dpsanders, could you elaborate on that a bit?

@nalimilan, note the `@.` in front of the expression, it converts operators into broadcasted form automatically. The question was about the difference between `a` and `a[:]`.

I know that, but in Julia 0.6 you can directly use `.=`, there’s no reason to use `@.` or `a[:]` for this kind of thing anymore.

`@views @.` is fully inplace. Asymtopically it’ll do as well as the devectorized form. Currently it has overhead because views are not stack-allocated, but that’s just a missing compiler optimization.

I agree there’s no reason to `a[:]` anymore, but `a .= @. ...` seems weird and is no more efficient than `@. a = ...`

`@.` is useful if you have a lot of dotted operators in the expression (I have only two, so it’s not that evident, but this example was reduced from a bigger one; see my post with the stencil code).

`[:]` is crucial to the question; I was worried about the performance of slices. Of course, `a` and `a[:]` give the same result (for a 1D array), but even a trivial slice `[:]` still reproduces the issue with performance. It’s a MWE, I tried to make it as simple as possible. In my actual code I have non-trivial slices, of course.

Glad to hear it, I was worried that I missed some kind of compiler hint. Although I do not really see how several small allocations can take so much time. I already have a 10^10-element array, when does it start to even out?

I honestly do not see this construction used anywhere in this thread. I only have `@. a = ...`.