When should I write loops or vectorised calls?

Hi,

Comming from numpy and R, I am customed to the fact that vectorised expressions are faster than loops. I understood this is not the case in julia, where vectorised expression are usually expended out to loops by the compiler anyway.

But is it always true ? Are there some cases where broadcasted dot calls will be faster than the corresponding loop ? I assume they are, but is there somewhere some heuristics to know when it will be the case or not ?

1 Like

Most of the time there is no difference. I think the main exception is when dot-broadcasting hooks into some special machinery, say for GPU-based arrays.

So, for normal day to day programming you should just choose that which reads better. If you need the last bit of performance, you should of course always profile (use BenchmarkTools.jl).

7 Likes

The most common mistake is the other way around, in which one allocates unintendedly something and the broadcasting operation becomes slower. Dots and broadcast must be written with care because of that.

julia> x = rand(1000);

julia> using BenchmarkTools

julia> @btime map(sqrt,x[1:500]);
  1.503 μs (3 allocations: 8.14 KiB)

julia> @btime map(sqrt,@view $x[1:500]);
  987.300 ns (1 allocation: 4.06 KiB)

julia> @btime sqrt.($x[1:500]);
  1.420 μs (2 allocations: 8.13 KiB)

julia> @btime sqrt.(@view $x[1:500]);
  988.600 ns (1 allocation: 4.06 KiB)

6 Likes

Expanding on previous example, vectorized assignment operations can be slightly slower

using BenchmarkTools

x = rand(100000)

f(x) = x[1:50000] .= sqrt.(@view x[1:50000])

function g(x)
    for i in 1:50000
        x[i] = sqrt(x[i])
    end
end

julia> @btime f(y) setup=(y=copy(x)) evals=1;
  79.260 μs (1 allocation: 48 bytes)

julia> @btime g(y) setup=(y=copy(x)) evals=1
  79.197 μs (0 allocations: 0 bytes)

But difference is so small, that it rarely matters.

1 Like

Yet knowing the loop versions and what can be done with them is a must:

julia> @btime g(y) setup=(y=copy(x)) evals=1
  79.283 μs (0 allocations: 0 bytes)

julia> using LoopVectorization

julia> function g(x)
           @avx for i in 1:50000
               x[i] = sqrt(x[i])
           end
       end
g (generic function with 1 method)

julia> @btime g(y) setup=(y=copy(x)) evals=1
  40.705 μs (0 allocations: 0 bytes)

5 Likes

I think we should distinguish the language and the ecosystem.

As far as the language is concerned, I don’t know a case where a well written for loop would be slower than broadcasting (with a caveat, see below).

Of course it’s possible to write loops that iterate on arrays using the “wrong” order for dimensions. Maybe broadcasting can help avoid this type of mistake. Example:

# Bad iteration
function f(A)
    B = similar(A)
    for i in axes(A, 1)
        for j in axes(A, 2)
            B[i,j] = abs(A[i,j])
        end
    end
    return B
end

# Good iteration
function g(A)
    B = similar(A)
    for j in axes(A, 2)
        for i in axes(A, 1)
            B[i,j] = abs(A[i,j])
        end
    end
    return B
end

# Broadcasting
function h(A)
    B = similar(A)
    B .= abs.(A)
    return B
end

julia> A = rand(3000, 3000);

julia> @btime f($A);
  269.833 ms (2 allocations: 68.66 MiB)

julia> @btime g($A);
  45.292 ms (2 allocations: 68.66 MiB)

julia> @btime h($A);
  33.961 ms (2 allocations: 68.66 MiB)

The wrong order of iteration gives a slow loop. Broadcasting uses the right order so it’s fast.

Wait, is broadcasting faster than the “good” loop? Well these loops do almost no work, much of the time is spent in indexing so bound checking is a significant overhead. This can be fixed:

function g2(A)
    B = similar(A)
    @inbounds for j in axes(A, 2)
        @inbounds for i in axes(A, 1)
            B[i,j] = abs(A[i,j])
        end
    end
    return B
end

julia> @btime g2($A);
  33.725 ms (2 allocations: 68.66 MiB)

It’s a nice thing that broadcasting does automatically, and you don’t have to worry about illegal uses of @inbounds. But maybe one day the compiler will make these particular @inbounds unnecessary.

The ecosystem is a different story: there’s no guarantee that a particular package won’t have worse performance when using loops. At least Zygote is known to be slower with loops than vectorized code, see e.g. Speed of vectorized vs for-loops using Zygote - #12 by Elrod . But it’s more the exception than the rule (at least it’s the only case I know).

8 Likes

I guess probably the most general answer is, then: if you are comfortable with using the vectorized version and write that without mistakenly allocating intermediate arrays, then vectorized operations have the best cost/benefit. If one of these operations is the critical part of the code, it might be a good idea expanding it into loops and taking advantage of special structures of the specific problem (this might include parallelizing the loop, for example).

In parallel, sometimes one or other option read better, and that is one reason to chose one or another in Julia, because you can.

One type of problem where loops are preferable is when you can ‘bail out’ early. You check a condition and break if appropriate. That is frequently a source of significant speedups.

6 Likes

Appart from @sijo answer, you took the time showing / arguing the fact that broadcasted calls are not so bad, although this is not what the OP (me) asked for: I wanted cases were the broadcasted calls will be faster than loops, not the other way around.

Appart what said @sijo about “writing a good loop is sometimes harder than writting a good broadcast”, are there some known cases were this is the case, or is it structurally not a thing in julia ?

It is not a thing. A properly written loop will be the same or faster (if some structure of the problem is used) than a broadcasted call.

3 Likes

Note that the optimal loop order can change based on the types of the arrays being accessed.

The LoopVectorization.@avx macro will try to pick the optimal loop order, taking into account Adjoint/Transpose/PermutedDimsArray wrappers.
This lets you write a single version of a loop, without needing to rewrite it to optimize for those cases.

@avx also works with broadcasts, but this has two limitations that loops don’t have:

  1. It will assume contiguous dimensions are not being broadcast. If they are being broadcast, you’ll probably get a segfault. This means you can’t write code like
x = rand(1,10)
A = rand(100,10);
@avx @. exp(x) + A
  1. (continued) but using x = rand(10)' instead would be fine.
  1. The fact that other dimensions can be broadcast makes it currently use slightly less efficient indexing. Basically, you can’t increment the pointers and then use the pointer to check whether to terminate the loop. This is only relevant when you have multiple nested loops; with a single loop it makes more sense to use a loop induction variable than increment the base pointer.

When not using @avx, you won’t get automatic loop order switching for transposed arrays.
But when broadcasting, LLVM’s solution to avoiding problems 1) and 2) above is to create multiple loop versions, optimized for different broadcasting special cases. However, if you use too many arrays in your broadcast, it’ll give up on multiversioning and you’ll end up not getting SIMD code at all.

3 Likes

Does that suggest that it is a good idea to use @avx even if it the simd operations are not being advantageous? Just in case a different data structure is given?

Note that it does add extra compiler latency, so it’d be your call as the author on whether the likelihood of alternate supported types is worth the trade-off.

But it should do better even in relatively simple cases:

using LoopVectorization, BenchmarkTools, Random
function mydotsimd(A, B)
    s = zero(promote_type(eltype(A), eltype(B)))
    @inbounds @simd for i in eachindex(A,B)
        s += A[i] * B[i]
    end
    s
end
function mydotavx(A, B)
    s = zero(promote_type(eltype(A), eltype(B)))
    @avx for i in eachindex(A,B)
        s += A[i] * B[i]
    end
    s
end

# benchmark random vector lengths from 1:512
N = 512;
x = rand(N); y = rand(N);
Ns = shuffle(1:N);
function testsizes(f::F, x::AbstractVector, y::AbstractVector, Ns) where {F}
    foreach(n -> @views(f(x[1:n], y[1:n])), Ns)
end
@btime testsizes(mydotsimd, $x, $y, $Ns)
@btime testsizes(mydotavx, $x, $y, $Ns)

For random sized vectors, yields:

julia> @btime testsizes(mydotsimd, $x, $y, $Ns)
  10.696 μs (0 allocations: 0 bytes)

julia> @btime testsizes(mydotavx, $x, $y, $Ns)
  8.156 μs (0 allocations: 0 bytes)

Now, getting a bit more complicated

M = 32;
A = rand(M, M);
B = rand(M, M);
Ms = shuffle(1:M);

function testsizes(f::F, x::AbstractMatrix, y::AbstractMatrix, Ns) where {F}
    foreach(n -> @views(f(x[1:n,1:n], y[1:n,1:n])), Ns)
end
@btime testsizes(mydotsimd, $A, $B, $Ms)
@btime testsizes(mydotavx, $A, $B, $Ms)

@btime testsizes(mydotsimd, $A', $B, $Ms)
@btime testsizes(mydotavx, $A', $B, $Ms)

@btime testsizes(mydotsimd, $A, $B', $Ms)
@btime testsizes(mydotavx, $A, $B', $Ms)

@btime testsizes(mydotsimd, $A', $B', $Ms)
@btime testsizes(mydotavx, $A', $B', $Ms)

And now we get:

julia> @btime testsizes(mydotsimd, $A, $B, $Ms)
  9.208 μs (0 allocations: 0 bytes)

julia> @btime testsizes(mydotavx, $A, $B, $Ms)
  1.056 μs (0 allocations: 0 bytes)

julia> @btime testsizes(mydotsimd, $A', $B, $Ms)
  9.630 μs (0 allocations: 0 bytes)

julia> @btime testsizes(mydotavx, $A', $B, $Ms)
  2.344 μs (0 allocations: 0 bytes)

julia> @btime testsizes(mydotsimd, $A, $B', $Ms)
  9.575 μs (0 allocations: 0 bytes)

julia> @btime testsizes(mydotavx, $A, $B', $Ms)
  2.329 μs (0 allocations: 0 bytes)

julia> @btime testsizes(mydotsimd, $A', $B', $Ms)
  9.628 μs (0 allocations: 0 bytes)

julia> @btime testsizes(mydotavx, $A', $B', $Ms)
  1.035 μs (0 allocations: 0 bytes)

Also helpful to look at specifically the 32x32 case, which is ideal for LLVM:

julia> @btime mydotsimd($A, $B)
  33.207 ns (0 allocations: 0 bytes)
255.8607178195125

julia> @btime mydotavx($A, $B)
  34.313 ns (0 allocations: 0 bytes)
255.8607178195125

julia> @btime mydotsimd($A', $B)
  852.364 ns (0 allocations: 0 bytes)
257.12741470722705

julia> @btime mydotavx($A', $B)
  157.423 ns (0 allocations: 0 bytes)
257.12741470722733

julia> @btime mydotsimd($A, $B')
  847.691 ns (0 allocations: 0 bytes)
257.12741470722705

julia> @btime mydotavx($A, $B')
  148.625 ns (0 allocations: 0 bytes)
257.12741470722733

julia> @btime mydotsimd($A', $B')
  853.938 ns (0 allocations: 0 bytes)
255.86071781951276

julia> @btime mydotavx($A', $B')
  51.592 ns (0 allocations: 0 bytes)
255.86071781951247

mydotavx(A', B') should be the same fast as mydotavx(A, B), but that doesn’t seem to be the case.
I’m currently working on rewriting the library, and will try to address this then.
I’m currently working on adding 1.6 support to the old version of the library, and will hopefully have that done reasonably soon. The rewrite will probably take a few months.

3 Likes

Thanks again, very much, for the class. I just found a couple of videos where you talk about these things, I have fun for the weekend guaranteed.

1 Like

Ha, thanks. I should really make a better video than the one from JuliaCon.
I’m also always happy to answer performance questions.

One of LLVM’s problems in vectorizing code is that it is overly optimistic.
Like for the A * B' case, it also generates SIMD code. It’s just only does so for the most optimistic possible scenario: what if B is actually a 1 x N matrix (where N is a multiple of 4x SIMD vector wdith), so that we can just do an ordinary dot product?

julia> A = rand(128,1);

julia> B = rand(1,128);

julia> @btime mydotsimd($A, $B')
  10.060 ns (0 allocations: 0 bytes)
33.02101475712813

julia> @btime mydotavx($A, $B')
  162.519 ns (0 allocations: 0 bytes)
33.02101475712813

The only optimized code path LLVM generates here is assuming that B is a 1 x N matrix. If it is 2 x N instead, we basically have scalar code:

julia> A = rand(128, 2);

julia> B = rand(2, 128);

julia> @btime mydotsimd($A, $B')
  191.710 ns (0 allocations: 0 bytes)
64.02246340273062

julia> @btime mydotavx($A, $B')
  163.333 ns (0 allocations: 0 bytes)
64.02246340273062

LoopVectorization is assuming B has enough rows to SIMD, so we see no performance drop when bumping the number of rows to 8

julia> A = rand(128, 8);

julia> B = rand(8, 128);

julia> @btime mydotsimd($A, $B')
  849.567 ns (0 allocations: 0 bytes)
268.0950263729372

julia> @btime mydotavx($A, $B')
  150.715 ns (0 allocations: 0 bytes)
268.095026372937

So, LoopVectorization.jl and LLVM are both optimizing for different scenarios here. I just thing LLVM’s are less representative of typical inputs.

3 Likes

Some good points, here. One thing to note: whether writing vectorized or loopy code, Julia should theoretically meet or beat vectorized code in interpreted languages (numpy, R, APL) due to the compiler providing interfunction optimizations (e.g. loop fusion).

1 Like

I think that, in the case the interpreted language delegates the vectorized operation to a C library, Julia can end up a little slower. But generically speaking, yes.

1 Like

Libraries like Numpy are often quite well optimized (as good as it can get with its Python API) e.g. with regard to SIMD, but in principle you could do the same in Julia. LoopVectorizaton.jl may help here to get the last bit of performance.