Loops vs Vectorization? When to use?

marianoarnaiz · February 16, 2023, 2:10pm

Hi everyone.
I am optimizing some older codes of mine. I find that sometimes vectorizing is faster than looping (which did not happen before) Could anyone explain when it is better to use one or the other in Julia.

Thanks

lmiq · February 16, 2023, 2:21pm

In most cases the advantage of a vectorized operation comes from the guarantee that the operation is inbounds (the broadcast fails if the dimensions do not match), something that you might have to inform explicitly the compiler if writing a loop.

Elrod · February 16, 2023, 2:23pm

It can also help if things are type unstable by providing a function barrier.

rafael.guerra · February 17, 2023, 12:18am

Chris, could you please decode your important statement for the noobs with a simple example?

In the code sample below, looping via a comprehension performs better than vectorization (if I understand the term correctly) but I do not see a function barrier:

A = ["abcd", [1,2,10], π]
length.(A)                  # 661 ns (9 allocs: 416 bytes)
[length(a) for a in A]      # 242 ns (3 allocs: 112 bytes)

zeroexcuses · February 17, 2023, 12:45am

I’m new to Julia. Can you please explain when vectorizing would be SLOWER than looping? Intuition says vectorizing (when possible) is always faster.

uniment · February 17, 2023, 1:16am

I’m sure you’re aware, that array comprehensions compile to a Generator (with an anonymous function), which is immediately collected into an array. (see the bottom six lines)

julia> Meta.@lower [length(a) for a in A]
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─      $(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─      global var"#13#14"
│        const var"#13#14"
│   %3 = Core._structtype(Main, Symbol("#13#14"), Core.svec(), Core.svec(), Core.svec(), false, 0)
│        Core._setsuper!(%3, Core.Function)
│        var"#13#14" = %3
│        Core._typebody!(%3, Core.svec())
└──      return nothing
)))
│   %2 = Core.svec(var"#13#14", Core.Any)
│   %3 = Core.svec()
│   %4 = Core.svec(%2, %3, $(QuoteNode(:(#= none:0 =#))))
│        $(Expr(:method, false, :(%4), CodeInfo(
    @ none within `none`
1 ─ %1 = length(a)
└──      return %1
)))
│        #13 = %new(var"#13#14")
│   %7 = #13
│   %8 = Base.Generator(%7, A)
│   %9 = Base.collect(%8)
└──      return %9
))))

fft · February 17, 2023, 2:47am

Vectorizing an operation is executing a loop under the hood. In theory a loop and a vectorized operation are doing the same thing.

rafael.guerra · February 17, 2023, 9:02am

Thank you and no, I am not. I only know Julia from the walks in the park, but not from seeing her under the operating table.

uniment · February 17, 2023, 9:14am

Ah, then you might find this interesting

DNF · February 17, 2023, 9:51am

Vectorized expressions sometimes create intermediate temporary arrays which are unnecessary. In many cases you can avoid this, but not always.

With a loop, you can sometimes get much better performance:

# vectorized
foo(A) = maximum(abs2.(A))

# loop
function bar(A)
    val = zero(eltype(A))
    for i in eachindex(A)
        @inbounds val = max(val, abs2(A[i]))
    end
    return val
end

Benchmarks:

julia> using BenchmarkTools

julia> A = rand(-100:100, 10_000);

julia> @btime foo($A);
  7.600 μs (2 allocations: 78.17 KiB)

julia> @btime bar($A);
  2.322 μs (0 allocations: 0 bytes)

As you see, the loop is much faster. But it’s not always easy to predict. If I supply a vector of Floats instead of a vector of Ints, the benchmarks are very different.

(Sorry, made an error in the first draft.)

mdogan · February 17, 2023, 9:56am

You may also want to check the “performance tips” from the official doc.

Performance Tips · The Julia Language

Maybe you’ll find some good tips.

LaurentPlagne · February 17, 2023, 10:10am

Maybe relevant for you, broadcast is also useful to build generic code running on both GPU and CPU.
For example, considering a n_s\times n_s mesh, a basic finite difference scheme can be written as

function update_force!(fx::AbstractArray{Float64,2},xc::AbstractArray{Float64,2},ks)
    ns=size(xc,1)
    r0=1:ns-2
    r1=2:ns-1
    r2=3:ns

    @views @. fx[r1,r1] = -ks*(4xc[r1,r1]
                            -xc[r1,r0]    # Down
                            -xc[r0,r1]    # Left
                            -xc[r2,r1]    # Right
                            -xc[r1,r2])   # Up
end

which corresponds to the loop version that will not directly run on GPUs.

function update_force!(fx::Array{Float64,2},xc::Array{Float64,2},ks)
    nsx,nsy=size(xc)
    @inbounds for j ∈ 2:nsy-1
        for i ∈ 2:nsx-1
            fx[i,j] = -ks*(4xc[i,j]
            -xc[i,j-1]
            -xc[i-1,j]
            -xc[i+1,j]
            -xc[i,j+1])
        end
    end
end

lmiq · February 17, 2023, 11:46am

Even the version that does not artificially create an intermediate,

foo(A) = maximum(abs2,A)

turns out to be slower than the loop:

julia> @btime foo($A);
  1.695 μs (0 allocations: 0 bytes)

julia> @btime bar($A);
  860.459 ns (0 allocations: 0 bytes)

(why in this case, I don’t know, but it seems that the loop version is being able to SIMD better).

Elrod · February 17, 2023, 6:55pm

rafael.guerra:

Chris, could you please decode your important statement for the noobs with a simple example?

In the code sample below, looping via a comprehension performs better than vectorization (if I understand the term correctly) but I do not see a function barrier:
A = ["abcd", [1,2,10], π]
length.(A)                  # 661 ns (9 allocs: 416 bytes)
[length(a) for a in A]      # 242 ns (3 allocs: 112 bytes)

A function barrier helps if you’re doing a lot of work behind the barrier, but length is not a lot of work.

Here is a small real-world example that may come up frequently:

using DataFrames
df = DataFrame(x = rand(1024); y = rand(1024));

function loop!(df)
    (; x, y) = df
    z = similar(x)
    @inbounds for i = eachindex(x,y,z)
        z[i] = x[i] + y[i]
    end
    df.z = z
    nothing
end
function broadcast!(df)
    df.z = df.x .+ df.y
    nothing
end
@btime loop!($df)
@btime broadcast!($df)

I get

julia> @btime loop!($df)
  104.254 μs (5124 allocations: 104.16 KiB)

julia> @btime broadcast!($df)
  1.188 μs (4 allocations: 8.20 KiB)

My point is that the broadcast itself is a function barrier.
A loop is not a function barrier.

Thus, if the broadcast is type stable, it will be fast even if the inputs are not. There will be a slow dynamic dispatch to the fast ::Vector{Float64} .+ ::Vector{Float64}.
However, a loop doesn’t have this barrier, so it suffers from dynamic dispatches on every iteration.

Elrod · February 17, 2023, 7:00pm

“vectorizing” in terms of broadcasting will often fail to “vectorize” in the low level sense even when loops would, because broadcasting is fundamentally dynamic.
This is why FastBroadcast.jl has the option broadcast=false, which defaults to false, meaning it won’t actually broadcast.
This gives access to the nice syntax of broadcasting and the ability to still support GPUs, but without the occasionally-performance-killing dynamism of broadcast=true.
Even if you set broadcast=true, FastBroadcast.jl will still special case the common situation of not broadcasting, and generate specialized code (loops) for only that case, falling back to a slower implementation if necessary.

Anyway, my point is that vectorized syntax by default is bad for vectorization. You can use FastBroadcast.jl to remedy this, which the SciML ecosystem does.

Mason · February 17, 2023, 7:01pm

Just a meta comment: the use of the word “vectorization” in programming is pretty overloaded. To some people, it means something along the lines of “functions acting on an entire array” and to others it means SIMD.

In julia circles, people typically mean SIMD when they say ‘vectorization’ but this is different from say Python, Matlab, and R where typically the only way to access SIMD is through their ‘vectorized’ functions, whereas in julia it’s pretty easy to get a loop to use SIMD instructions (it’ll often just happen automatically).

This can sometimes cause people to get confused on both sides, so something to be aware of.

rafael.guerra · February 17, 2023, 7:33pm

@Elrod, thanks for the very educational example using DataFrames, which are type-unstable, making your point very clear. Very appreciated.

As a side note, if we create a function barrier with the loop then it would be even faster than the vectorized version:

function sumxy!(z, x, y)
    @inbounds for i = eachindex(x,y,z)
        z[i] = x[i] + y[i]
    end
    return z
end
function loop!(df)
    (; x, y) = df
    z = similar(x)
    df.z = sumxy!(z, x, y)  # function barrier
    nothing
end

Eben60 · February 17, 2023, 8:26pm

Can anybody explain why the type instability? I’d naively think, df is passed as a function argument, eltype of the df columns is Float64

rafael.guerra · February 17, 2023, 9:29pm

@Eben60, see @bkamins multiple posts and courses on this matter, including here, and his latest book that covers this issue nicely.

Basically, the columns of a DataFrame are AbstractVectors (for maximal user flexibility) and the compiler does not know the actual memory layout. However, inside the function barrier, everything becomes type stable for the compiler and faster code can be generated.

Topic		Replies	Views
When should I write loops or vectorised calls? General Usage	17	1845	December 1, 2020
Broadcast vs. scalar loop, can Julia vectorize better? Internals & Design	8	1955	February 15, 2020
Comprehensions and vectorization New to Julia	13	4984	September 20, 2017
Simple loop won't vectorize New to Julia	12	1665	January 29, 2019
Loop vs vectorization Performance	4	1576	October 10, 2023

Loops vs Vectorization? When to use?

Related topics