Performance drop for explicit indexing in fused loops

fjarri · June 26, 2017, 5:23am

Consider the following code doing a simple computation:

function test_vectorized(a, a_old)
    @. a[:] = a_old[:] * 2
end


function test_vectorized2(a, a_old)
    @. a = a_old * 2
end


function test_devectorized(a, a_old)
    nx = size(a, 1)
    @inbounds @simd for i = 1:nx
        a[i] = a_old[i] * 2
    end
end


a = rand(100000000)

a1 = copy(a)
test_vectorized(a1, a)
@time test_vectorized(a1, a)


a2 = copy(a)
test_vectorized2(a2, a)
@time test_vectorized2(a2, a)

a3 = copy(a)
test_devectorized(a3, a)
@time test_devectorized(a3, a)

@assert isapprox(a1, a2)
@assert isapprox(a1, a3)

In Julia 0.6 the first vectorized function is about 5 times slower:

> julia test.jl
  0.617197 seconds (86 allocations: 762.946 MiB, 17.37% gc time)
  0.127845 seconds (4 allocations: 160 bytes)
  0.125778 seconds (4 allocations: 160 bytes)

It would seem that there is enough information in test_vectorized() for the compiler to turn it into something similar to test_devectorized(), but, apparently, it is not the case. test_vectorized2(), on the other hand, shows the same performance. I guess the explicit indexing in test_vectorized() is the reason. Is there some way to improve the performance of fused loops in case of explicit indexing? In my actual program I am performing a stencil-type calculation, so I need a way to specify ranges (i.e., I have something like

@. a[2:end-1, 2:end-1] = (
    (a_old[1:end-2, 2:end-1] + a_old[3:end, 2:end-1]) * dyd
    + (a_old[2:end-1,1:end-2] + a_old[2:end-1, 3:end]) * dxd)

)

aaowens · June 26, 2017, 5:45am

@views function test_vectorized3(a, a_old)
    @. a[:] = a_old[:] * 2
end
julia> @time test_vectorized3(a2, a);
  0.126813 seconds (5 allocations: 208 bytes)

I think views will also work for your stencils.

fjarri · June 26, 2017, 6:51am

Thanks, that did help a lot! The performance is the same for the simple operation from the opening post, and the performance of a stencil is close to the devectorized version, although still not exactly the same (~25% slower):

function test_vectorized(a, a_old)
    @views @. a[2:end-1, 2:end-1] = (
        (a_old[1:end-2, 2:end-1] + a_old[3:end, 2:end-1]) * 0.1
        + (a_old[2:end-1,1:end-2] + a_old[2:end-1, 3:end]) * 0.2)
end

function test_devectorized(a, a_old)
    nx, ny = size(a)
    @inbounds @simd for j = 2:(ny-1)
        for i = 2:(nx-1)
            a[i,j] = ((a_old[i-1, j] + a_old[i+1, j]) * 0.1 
                + (a_old[i, j-1] + a_old[i, j+1]) * 0.2)
        end
    end
end


a = rand(10000, 10000)

a1 = copy(a)
test_vectorized(a1, a)
@time test_vectorized(a1, a)

a2 = copy(a)
test_devectorized(a2, a)
@time test_devectorized(a2, a)

@assert isapprox(a1, a2)

This gives

> julia test.jl
  0.218883 seconds (87 allocations: 6.623 KiB)
  0.152084 seconds (4 allocations: 160 bytes)

(without the @views, test_vectorized() is about 5 times slower).

traktofon · June 26, 2017, 9:22am

It seems that your first @time call is also timing the compilation of @time itself (or more precisely, of what it expands to), hence the much larger number of allocations. In general, it is recommended to use the package BenchmarkTools.jl for benchmarking, as follows:

julia> using BenchmarkTools

julia> @btime test_vectorized($a1, $a);
  215.568 ms (4 allocations: 256 bytes)

julia> @btime test_devectorized($a2, $a);
  187.499 ms (0 allocations: 0 bytes)

So there is indeed some overhead to the vectorized/views version, which I think currently cannot be avoided, because each view needs an allocation (although a small one).

dpsanders · June 26, 2017, 1:38pm

Don’t benchmark in global scope.

nalimilan · June 26, 2017, 1:41pm

On Julia 0.6, you can also use the new syntax a .= a_old .* 2, which should perform quite well as a replacement for a[:] = a_old[:] * 2.

fjarri · June 26, 2017, 1:52pm

@dpsanders, could you elaborate on that a bit?

@nalimilan, note the @. in front of the expression, it converts operators into broadcasted form automatically. The question was about the difference between a and a[:].

nalimilan · June 26, 2017, 2:09pm

I know that, but in Julia 0.6 you can directly use .=, there’s no reason to use @. or a[:] for this kind of thing anymore.

ChrisRackauckas · June 26, 2017, 2:17pm

@views @. is fully inplace. Asymtopically it’ll do as well as the devectorized form. Currently it has overhead because views are not stack-allocated, but that’s just a missing compiler optimization.

I agree there’s no reason to a[:] anymore, but a .= @. ... seems weird and is no more efficient than @. a = ...

fjarri · June 26, 2017, 2:24pm

@. is useful if you have a lot of dotted operators in the expression (I have only two, so it’s not that evident, but this example was reduced from a bigger one; see my post with the stencil code).

[:] is crucial to the question; I was worried about the performance of slices. Of course, a and a[:] give the same result (for a 1D array), but even a trivial slice [:] still reproduces the issue with performance. It’s a MWE, I tried to make it as simple as possible. In my actual code I have non-trivial slices, of course.

fjarri · June 26, 2017, 2:25pm

Glad to hear it, I was worried that I missed some kind of compiler hint. Although I do not really see how several small allocations can take so much time. I already have a 10^10-element array, when does it start to even out?

I honestly do not see this construction used anywhere in this thread. I only have @. a = ....

Topic		Replies	Views
Comprehensions and vectorization New to Julia	13	4917	September 20, 2017
How to optimize computation within vectorized list operation and large array? Performance	16	465	October 21, 2022
Large performance regression of vectorized functions on master General Usage	4	614	July 25, 2018
Unexpected poor FOR loops performance Performance question	8	398	June 28, 2025
Performance difference between two code New to Julia	3	782	January 18, 2017

Performance drop for explicit indexing in fused loops

Related topics