Performance drop for explicit indexing in fused loops

Consider the following code doing a simple computation:

function test_vectorized(a, a_old)
    @. a[:] = a_old[:] * 2
end


function test_vectorized2(a, a_old)
    @. a = a_old * 2
end


function test_devectorized(a, a_old)
    nx = size(a, 1)
    @inbounds @simd for i = 1:nx
        a[i] = a_old[i] * 2
    end
end


a = rand(100000000)

a1 = copy(a)
test_vectorized(a1, a)
@time test_vectorized(a1, a)


a2 = copy(a)
test_vectorized2(a2, a)
@time test_vectorized2(a2, a)

a3 = copy(a)
test_devectorized(a3, a)
@time test_devectorized(a3, a)

@assert isapprox(a1, a2)
@assert isapprox(a1, a3)

In Julia 0.6 the first vectorized function is about 5 times slower:

> julia test.jl
  0.617197 seconds (86 allocations: 762.946 MiB, 17.37% gc time)
  0.127845 seconds (4 allocations: 160 bytes)
  0.125778 seconds (4 allocations: 160 bytes)

It would seem that there is enough information in test_vectorized() for the compiler to turn it into something similar to test_devectorized(), but, apparently, it is not the case. test_vectorized2(), on the other hand, shows the same performance. I guess the explicit indexing in test_vectorized() is the reason. Is there some way to improve the performance of fused loops in case of explicit indexing? In my actual program I am performing a stencil-type calculation, so I need a way to specify ranges (i.e., I have something like

@. a[2:end-1, 2:end-1] = (
    (a_old[1:end-2, 2:end-1] + a_old[3:end, 2:end-1]) * dyd
    + (a_old[2:end-1,1:end-2] + a_old[2:end-1, 3:end]) * dxd)

)

1 Like
@views function test_vectorized3(a, a_old)
    @. a[:] = a_old[:] * 2
end
julia> @time test_vectorized3(a2, a);
  0.126813 seconds (5 allocations: 208 bytes)

I think views will also work for your stencils.

Thanks, that did help a lot! The performance is the same for the simple operation from the opening post, and the performance of a stencil is close to the devectorized version, although still not exactly the same (~25% slower):

function test_vectorized(a, a_old)
    @views @. a[2:end-1, 2:end-1] = (
        (a_old[1:end-2, 2:end-1] + a_old[3:end, 2:end-1]) * 0.1
        + (a_old[2:end-1,1:end-2] + a_old[2:end-1, 3:end]) * 0.2)
end

function test_devectorized(a, a_old)
    nx, ny = size(a)
    @inbounds @simd for j = 2:(ny-1)
        for i = 2:(nx-1)
            a[i,j] = ((a_old[i-1, j] + a_old[i+1, j]) * 0.1 
                + (a_old[i, j-1] + a_old[i, j+1]) * 0.2)
        end
    end
end


a = rand(10000, 10000)

a1 = copy(a)
test_vectorized(a1, a)
@time test_vectorized(a1, a)

a2 = copy(a)
test_devectorized(a2, a)
@time test_devectorized(a2, a)

@assert isapprox(a1, a2)

This gives

> julia test.jl
  0.218883 seconds (87 allocations: 6.623 KiB)
  0.152084 seconds (4 allocations: 160 bytes)

(without the @views, test_vectorized() is about 5 times slower).

It seems that your first @time call is also timing the compilation of @time itself (or more precisely, of what it expands to), hence the much larger number of allocations. In general, it is recommended to use the package BenchmarkTools.jl for benchmarking, as follows:

julia> using BenchmarkTools

julia> @btime test_vectorized($a1, $a);
  215.568 ms (4 allocations: 256 bytes)

julia> @btime test_devectorized($a2, $a);
  187.499 ms (0 allocations: 0 bytes)

So there is indeed some overhead to the vectorized/views version, which I think currently cannot be avoided, because each view needs an allocation (although a small one).

Don’t benchmark in global scope.

On Julia 0.6, you can also use the new syntax a .= a_old .* 2, which should perform quite well as a replacement for a[:] = a_old[:] * 2.

@dpsanders, could you elaborate on that a bit?

@nalimilan, note the @. in front of the expression, it converts operators into broadcasted form automatically. The question was about the difference between a and a[:].

I know that, but in Julia 0.6 you can directly use .=, there’s no reason to use @. or a[:] for this kind of thing anymore.

@views @. is fully inplace. Asymtopically it’ll do as well as the devectorized form. Currently it has overhead because views are not stack-allocated, but that’s just a missing compiler optimization.

I agree there’s no reason to a[:] anymore, but a .= @. ... seems weird and is no more efficient than @. a = ...

@. is useful if you have a lot of dotted operators in the expression (I have only two, so it’s not that evident, but this example was reduced from a bigger one; see my post with the stencil code).

[:] is crucial to the question; I was worried about the performance of slices. Of course, a and a[:] give the same result (for a 1D array), but even a trivial slice [:] still reproduces the issue with performance. It’s a MWE, I tried to make it as simple as possible. In my actual code I have non-trivial slices, of course.

Glad to hear it, I was worried that I missed some kind of compiler hint. Although I do not really see how several small allocations can take so much time. I already have a 10^10-element array, when does it start to even out?

I honestly do not see this construction used anywhere in this thread. I only have @. a = ....