Loops vs Vectorization? When to use?

Hi everyone.
I am optimizing some older codes of mine. I find that sometimes vectorizing is faster than looping (which did not happen before) Could anyone explain when it is better to use one or the other in Julia.

Thanks :slight_smile:

In most cases the advantage of a vectorized operation comes from the guarantee that the operation is inbounds (the broadcast fails if the dimensions do not match), something that you might have to inform explicitly the compiler if writing a loop.

2 Likes

It can also help if things are type unstable by providing a function barrier.

3 Likes

Chris, could you please decode your important statement for the noobs with a simple example?

In the code sample below, looping via a comprehension performs better than vectorization (if I understand the term correctly) but I do not see a function barrier:

A = ["abcd", [1,2,10], π]
length.(A)                  # 661 ns (9 allocs: 416 bytes)
[length(a) for a in A]      # 242 ns (3 allocs: 112 bytes)
1 Like

I’m new to Julia. Can you please explain when vectorizing would be SLOWER than looping? Intuition says vectorizing (when possible) is always faster.

I’m sure you’re aware, that array comprehensions compile to a Generator (with an anonymous function), which is immediately collected into an array. (see the bottom six lines)

julia> Meta.@lower [length(a) for a in A]
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─      $(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─      global var"#13#14"
│        const var"#13#14"
│   %3 = Core._structtype(Main, Symbol("#13#14"), Core.svec(), Core.svec(), Core.svec(), false, 0)
│        Core._setsuper!(%3, Core.Function)
│        var"#13#14" = %3
│        Core._typebody!(%3, Core.svec())
└──      return nothing
)))
│   %2 = Core.svec(var"#13#14", Core.Any)
│   %3 = Core.svec()
│   %4 = Core.svec(%2, %3, $(QuoteNode(:(#= none:0 =#))))
│        $(Expr(:method, false, :(%4), CodeInfo(
    @ none within `none`
1 ─ %1 = length(a)
└──      return %1
)))
│        #13 = %new(var"#13#14")
│   %7 = #13
│   %8 = Base.Generator(%7, A)
│   %9 = Base.collect(%8)
└──      return %9
))))
1 Like

Vectorizing an operation is executing a loop under the hood. In theory a loop and a vectorized operation are doing the same thing.

Thank you and no, I am not. I only know Julia from the walks in the park, but not from seeing her under the operating table.

1 Like

Ah, then you might find this interesting :wink:

1 Like

Vectorized expressions sometimes create intermediate temporary arrays which are unnecessary. In many cases you can avoid this, but not always.

With a loop, you can sometimes get much better performance:

# vectorized
foo(A) = maximum(abs2.(A))

# loop
function bar(A)
    val = zero(eltype(A))
    for i in eachindex(A)
        @inbounds val = max(val, abs2(A[i]))
    end
    return val
end

Benchmarks:

julia> using BenchmarkTools

julia> A = rand(-100:100, 10_000);

julia> @btime foo($A);
  7.600 μs (2 allocations: 78.17 KiB)

julia> @btime bar($A);
  2.322 μs (0 allocations: 0 bytes)

As you see, the loop is much faster. But it’s not always easy to predict. If I supply a vector of Floats instead of a vector of Ints, the benchmarks are very different.

(Sorry, made an error in the first draft.)

2 Likes

You may also want to check the “performance tips” from the official doc.

Performance Tips · The Julia Language

Maybe you’ll find some good tips.

Maybe relevant for you, broadcast is also useful to build generic code running on both GPU and CPU.
For example, considering a n_s\times n_s mesh, a basic finite difference scheme can be written as

function update_force!(fx::AbstractArray{Float64,2},xc::AbstractArray{Float64,2},ks)
    ns=size(xc,1)
    r0=1:ns-2
    r1=2:ns-1
    r2=3:ns

    @views @. fx[r1,r1] = -ks*(4xc[r1,r1]
                            -xc[r1,r0]    # Down
                            -xc[r0,r1]    # Left
                            -xc[r2,r1]    # Right
                            -xc[r1,r2])   # Up
end

which corresponds to the loop version that will not directly run on GPUs.

function update_force!(fx::Array{Float64,2},xc::Array{Float64,2},ks)
    nsx,nsy=size(xc)
    @inbounds for j ∈ 2:nsy-1
        for i ∈ 2:nsx-1
            fx[i,j] = -ks*(4xc[i,j]
            -xc[i,j-1]
            -xc[i-1,j]
            -xc[i+1,j]
            -xc[i,j+1])
        end
    end
end
3 Likes

Even the version that does not artificially create an intermediate,

foo(A) = maximum(abs2,A)

turns out to be slower than the loop:

julia> @btime foo($A);
  1.695 μs (0 allocations: 0 bytes)

julia> @btime bar($A);
  860.459 ns (0 allocations: 0 bytes)

(why in this case, I don’t know, but it seems that the loop version is being able to SIMD better).

A function barrier helps if you’re doing a lot of work behind the barrier, but length is not a lot of work.

Here is a small real-world example that may come up frequently:

using DataFrames
df = DataFrame(x = rand(1024); y = rand(1024));

function loop!(df)
    (; x, y) = df
    z = similar(x)
    @inbounds for i = eachindex(x,y,z)
        z[i] = x[i] + y[i]
    end
    df.z = z
    nothing
end
function broadcast!(df)
    df.z = df.x .+ df.y
    nothing
end
@btime loop!($df)
@btime broadcast!($df)

I get

julia> @btime loop!($df)
  104.254 μs (5124 allocations: 104.16 KiB)

julia> @btime broadcast!($df)
  1.188 μs (4 allocations: 8.20 KiB)

My point is that the broadcast itself is a function barrier.
A loop is not a function barrier.

Thus, if the broadcast is type stable, it will be fast even if the inputs are not. There will be a slow dynamic dispatch to the fast ::Vector{Float64} .+ ::Vector{Float64}.
However, a loop doesn’t have this barrier, so it suffers from dynamic dispatches on every iteration.

2 Likes

“vectorizing” in terms of broadcasting will often fail to “vectorize” in the low level sense even when loops would, because broadcasting is fundamentally dynamic.
This is why FastBroadcast.jl has the option broadcast=false, which defaults to false, meaning it won’t actually broadcast.
This gives access to the nice syntax of broadcasting and the ability to still support GPUs, but without the occasionally-performance-killing dynamism of broadcast=true.
Even if you set broadcast=true, FastBroadcast.jl will still special case the common situation of not broadcasting, and generate specialized code (loops) for only that case, falling back to a slower implementation if necessary.

Anyway, my point is that vectorized syntax by default is bad for vectorization. You can use FastBroadcast.jl to remedy this, which the SciML ecosystem does.

1 Like

Just a meta comment: the use of the word “vectorization” in programming is pretty overloaded. To some people, it means something along the lines of “functions acting on an entire array” and to others it means SIMD.

In julia circles, people typically mean SIMD when they say ‘vectorization’ but this is different from say Python, Matlab, and R where typically the only way to access SIMD is through their ‘vectorized’ functions, whereas in julia it’s pretty easy to get a loop to use SIMD instructions (it’ll often just happen automatically).

This can sometimes cause people to get confused on both sides, so something to be aware of.

6 Likes

@Elrod, thanks for the very educational example using DataFrames, which are type-unstable, making your point very clear. Very appreciated.

As a side note, if we create a function barrier with the loop then it would be even faster than the vectorized version:

function sumxy!(z, x, y)
    @inbounds for i = eachindex(x,y,z)
        z[i] = x[i] + y[i]
    end
    return z
end
function loop!(df)
    (; x, y) = df
    z = similar(x)
    df.z = sumxy!(z, x, y)  # function barrier
    nothing
end
1 Like

Can anybody explain why the type instability? I’d naively think, df is passed as a function argument, eltype of the df columns is Float64

@Eben60, see @bkamins multiple posts and courses on this matter, including here, and his latest book that covers this issue nicely.

Basically, the columns of a DataFrame are AbstractVectors (for maximal user flexibility) and the compiler does not know the actual memory layout. However, inside the function barrier, everything becomes type stable for the compiler and faster code can be generated.

3 Likes