Broadcast vs. scalar loop, can Julia vectorize better?

HenrikM · February 14, 2020, 9:40am

Or is there something wrong with my code?

Below is a MVE of my code. func1 and func2 do the same thing. Why is func2 faster than func1?

Why can’t the compiler figure it out? I think func1 more clearly expresses my intent of performing a vectorized operation. Is there something in func1 that prevents the compiler from generating good code?

I am using Julia Version 1.3.1, Commit 2d57411 (2019-12-30 21:36 UTC).

function func1(a, b, c)
    for jdx = 2:size(a, 2)
        a[:, jdx] .= a[:, jdx - 1] .+ b[:, jdx] .* c[jdx]
    end
    return a
end
function func2(a, b, c)
    for jdx = 2:size(a, 2)
        for idx in axes(a, 1)
            a[idx, jdx] = a[idx, jdx - 1] + b[idx, jdx] * c[jdx]
        end
    end
    return a
end

With

a = zeros(3, 100)
b = randn(3, 100)
c = randn(100)

The results are:

@benchmark func1(a, b, c)
BenchmarkTools.Trial: 
  allocs estimate:  198
  --------------
  median time:      6.413 μs (0.00% GC)
  --------------

and

@benchmark func2(a, b, c)
BenchmarkTools.Trial: 
  allocs estimate:  0
  --------------
  median time:      407.980 ns (0.00% GC)
  --------------

For a comprehensive comparison—and an example of what a newbie tries—, see below: (PS: interpolation as in func1($a, $b, $c) does not seem to matter, contrary to what we are told).

Time in seconds

Function	No interp.	Interpolation	Interp. vs. no-interp
func1	6.41E-06	6.20E-06	3.50%
func2	4.08E-07	3.97E-07	2.79%

func1b	6.42E-06	6.19E-06	3.60%
func2b	4.08E-07	3.97E-07	2.85%

func1c	6.28E-06	6.15E-06	2.06%
func2c	3.99E-07	3.82E-07	4.53%

func2d	3.83E-07	3.72E-07	2.75%

func1e	4.84E-06	4.80E-06	0.77%
func2e	4.01E-07	3.89E-07	3.04%

function func1(a, b, c)
    for jdx = 2:size(a, 2)
        a[:, jdx] .= a[:, jdx - 1] .+ b[:, jdx] .* c[jdx]
    end
    return a
end
function func2(a, b, c)
    for jdx = 2:size(a, 2)
        for idx in axes(a, 1)
            a[idx, jdx] = a[idx, jdx - 1] + b[idx, jdx] * c[jdx]
        end
    end
    return a
end
function func1b(a::Matrix, b::Matrix, c::Vector)
    for jdx = 2:size(a, 2)
        a[:, jdx] .= a[:, jdx - 1] .+ b[:, jdx] .* c[jdx]
    end
    return a
end
function func2b(a::Matrix, b::Matrix, c::Vector)
    for jdx = 2:size(a, 2)
        for idx in axes(a, 1)
            a[idx, jdx] = a[idx, jdx - 1] + b[idx, jdx] * c[jdx]
        end
    end
    return a
end
function func1c(a, b, c)
    @inbounds for jdx = 2:size(a, 2)
        a[:, jdx] .= a[:, jdx - 1] .+ b[:, jdx] .* c[jdx]
    end
    return a
end
function func2c(a, b, c)
    @inbounds for jdx = 2:size(a, 2)
        for idx in axes(a, 1)
            a[idx, jdx] = a[idx, jdx - 1] + b[idx, jdx] * c[jdx]
        end
    end
    return a
end
function func2d(a, b, c)
    @inbounds for jdx = 2:size(a, 2)
        @simd for idx in axes(a, 1)
            a[idx, jdx] = a[idx, jdx - 1] + b[idx, jdx] * c[jdx]
        end
    end
    return a
end
function func1e(a, b, c)
    @inbounds @views for jdx = 2:size(a, 2)
        a[:, jdx] .= a[:, jdx - 1] .+ b[:, jdx] .* c[jdx]
    end
    return a
end
function func2e(a, b, c)
    @inbounds for jdx = 2:size(a, 2)
        for idx in axes(a, 1)
            a[idx, jdx] = a[idx, jdx - 1] + b[idx, jdx] * c[jdx]
        end
    end
    return a
end

DNF · February 14, 2020, 10:08am

Interpolation of variables in benchmarking sometimes matters a lot, and sometimes hardly at all. But it’s not necessarily easy to predict when it matters.

As for your code, the reason that func1 is slower than func2 is that slices produce copies, so that when you write

a[:, jdx - 1] .+ b[:, jdx]

a copy is made of one column of both a and b, which has a performance cost. In func2 there are no copies, and this is normally the fastest way.

You can get performance approaching that of func2 if you use views instead of copies, like this:

@views a[:, jdx] .= a[:, jdx - 1] .+ b[:, jdx] .* c[jdx]

It should be faster, though perhaps not quite as fast as the double loop.

simeonschaub · February 14, 2020, 10:25am

At least for me on 1.3, using @views is still more than a factor of 10x slower and has quite a bit of allocations. (It’s a bit better on master, but not much.) I was able to get a bit closer using UnsafeArrays.jl though:

function func3(a, b, c)
    for jdx = 2:size(a, 2)
        uview(a, :, jdx) .= uview(a, :, jdx - 1) .+ uview(b, :, jdx) .* c[jdx]
    end
    return a
end

At least, this one doesn’t allocate anymore, but it’s still more than a factor of 2x slower than the handwritten loop.

DNF · February 14, 2020, 10:32am

Yeah, me too.

But here’s a weird one: If I add @inbounds in func2 it slows down by 40%(!) What’s that about?

simeonschaub · February 14, 2020, 10:35am

For me it’s only 20%, but I see the same thing

HenrikM · February 14, 2020, 10:36am

Sorry for not including the @view. I hade that in my original code but forgot to post it. I updated the table. The question still persists: Why do we have to manually unroll the loop?

The interpolation is just a red herring.

DNF · February 14, 2020, 10:38am

AFAIK, views are just not sufficiently optimized yet. It’s a work in progress to remove the overhead, so that is expected, for now. The surprise is how much slower it is.

tim.holy · February 14, 2020, 11:58am

There’s also a proposal to handle nonscalar indexing via broadcasting:

https://github.com/JuliaLang/julia/issues/30845

There are potential downsides, and if we decide to do it we probably can’t implement it until Julia 2.0. But it’s a fascinating writeup.

Elrod · February 15, 2020, 7:28pm

Bounds checks are extremely cheap. @inbounds helps mostly because it allows code to vectorize by getting rid of branches.
If you add @inbounds, LLVM will vectorize the inner loop to be 4 * W iterations, where W is the vector width. It will then add a scalar loop to catch the remainder.

However, the inner loop here is only 3 iterations.
Meaning, regardless of W (which probably equals 2, 4, or 8 given double precision), 4W > 3, meaning the vectorized part will not be run at all, and the primary difference between @inbounds and not-@inbounds is that one version generated a bunch of code that is not actually run.

However, on a computer where W = 8:

julia> using BenchmarkTools

julia> function func2!(a, b, c)
           for jdx = 2:size(a, 2)
               for idx in axes(a, 1)
                   a[idx, jdx] = a[idx, jdx - 1] + b[idx, jdx] * c[jdx]
               end
           end
           return a
       end
func2! (generic function with 1 method)

julia> function func3!(a, b, c)
           @inbounds for jdx = 2:size(a, 2)
               @fastmath for idx in axes(a, 1)
                   a[idx, jdx] = a[idx, jdx - 1] + b[idx, jdx] * c[jdx]
               end
           end
           return a
       end
func3! (generic function with 1 method)

julia> a3 = zeros(3, 100);

julia> b3 = randn(3, 100);

julia> a31 = zeros(31, 100);

julia> b31 = randn(31, 100);

julia> a32 = zeros(32, 100);

julia> b32 = randn(32, 100);

julia> c = randn(100);

julia> @benchmark func2!($a3, $b3, $c)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     297.405 ns (0.00% GC)
  median time:      297.699 ns (0.00% GC)
  mean time:        301.657 ns (0.00% GC)
  maximum time:     449.135 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     259

julia> @benchmark func3!($a3, $b3, $c)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     253.055 ns (0.00% GC)
  median time:      254.978 ns (0.00% GC)
  mean time:        258.785 ns (0.00% GC)
  maximum time:     396.309 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     362

julia> @benchmark func2!($a31, $b31, $c)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.135 μs (0.00% GC)
  median time:      2.139 μs (0.00% GC)
  mean time:        2.175 μs (0.00% GC)
  maximum time:     4.398 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark func3!($a31, $b31, $c)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.477 μs (0.00% GC)
  median time:      1.481 μs (0.00% GC)
  mean time:        1.516 μs (0.00% GC)
  maximum time:     4.467 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark func2!($a32, $b32, $c)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.201 μs (0.00% GC)
  median time:      2.206 μs (0.00% GC)
  mean time:        2.309 μs (0.00% GC)
  maximum time:     4.743 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark func3!($a32, $b32, $c)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     788.899 ns (0.00% GC)
  median time:      789.970 ns (0.00% GC)
  mean time:        803.397 ns (0.00% GC)
  maximum time:     1.266 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     99

I did not see the same performance hit from the extraneous @inbounds you did. But note the huge performance bump the @inbounds version saw as soon as there were 32 rows (4 * W): evaluating one extra row made the total time drop by half.

FWIW, the fastest version I tried:

julia> function func4!(a, b, c)
           @inbounds for jdx = 2:size(a, 2)
               @simd ivdep for idx in axes(a, 1)
                   a[idx, jdx] = muladd(b[idx, jdx], c[jdx], a[idx, jdx - 1])
               end
           end
           return a
       end
func4! (generic function with 1 method)

julia> @benchmark func4!($a32, $b32, $c)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     525.853 ns (0.00% GC)
  median time:      534.086 ns (0.00% GC)
  mean time:        544.551 ns (0.00% GC)
  maximum time:     903.990 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     191

muladd combined the separate multiplication and addition instructions into one, and @simd videp got rid of a lot of run-time aliasing checks. This benchmark was with 32 rows.

Is 3 representative of the number of rows your problem will have?
If it will always be 3, you could hard-code it. If it will always be low, and not change much, you could consider using vectors of StaticVectors instead of matrices.

Topic		Replies	Views
Array broadcasting slower than numpy? Performance	20	721	June 4, 2022
Loop vs vectorization Performance	4	1479	October 10, 2023
LoopVectorization: @turbo performs worse than @inbounds on trivial loop New to Julia question , simd , loopvectorization	9	2099	August 28, 2021
Help Improving Performance of a Loop Performance performance , loops	15	1090	February 16, 2021
When should I write loops or vectorised calls? General Usage	17	1789	December 1, 2020

Broadcast vs. scalar loop, can Julia vectorize better?

Related topics