# Broadcast vs. scalar loop, can Julia vectorize better?

Or is there something wrong with my code?

Below is a MVE of my code. `func1` and `func2` do the same thing. Why is `func2` faster than `func1`?

Why can’t the compiler figure it out? I think `func1` more clearly expresses my intent of performing a vectorized operation. Is there something in `func1` that prevents the compiler from generating good code?

I am using `Julia Version 1.3.1`, `Commit 2d57411 (2019-12-30 21:36 UTC)`.

``````function func1(a, b, c)
for jdx = 2:size(a, 2)
a[:, jdx] .= a[:, jdx - 1] .+ b[:, jdx] .* c[jdx]
end
return a
end
function func2(a, b, c)
for jdx = 2:size(a, 2)
for idx in axes(a, 1)
a[idx, jdx] = a[idx, jdx - 1] + b[idx, jdx] * c[jdx]
end
end
return a
end
``````

With

``````a = zeros(3, 100)
b = randn(3, 100)
c = randn(100)
``````

The results are:

``````@benchmark func1(a, b, c)
BenchmarkTools.Trial:
allocs estimate:  198
--------------
median time:      6.413 μs (0.00% GC)
--------------
``````

and

``````@benchmark func2(a, b, c)
BenchmarkTools.Trial:
allocs estimate:  0
--------------
median time:      407.980 ns (0.00% GC)
--------------
``````

For a comprehensive comparison—and an example of what a newbie tries—, see below: (PS: interpolation as in `func1(\$a, \$b, \$c)` does not seem to matter, contrary to what we are told).

Time in seconds

Function No interp. Interpolation Interp. vs. no-interp
func1 6.41E-06 6.20E-06 3.50%
func2 4.08E-07 3.97E-07 2.79%
func1b 6.42E-06 6.19E-06 3.60%
func2b 4.08E-07 3.97E-07 2.85%
func1c 6.28E-06 6.15E-06 2.06%
func2c 3.99E-07 3.82E-07 4.53%
func2d 3.83E-07 3.72E-07 2.75%
func1e 4.84E-06 4.80E-06 0.77%
func2e 4.01E-07 3.89E-07 3.04%
``````function func1(a, b, c)
for jdx = 2:size(a, 2)
a[:, jdx] .= a[:, jdx - 1] .+ b[:, jdx] .* c[jdx]
end
return a
end
function func2(a, b, c)
for jdx = 2:size(a, 2)
for idx in axes(a, 1)
a[idx, jdx] = a[idx, jdx - 1] + b[idx, jdx] * c[jdx]
end
end
return a
end
function func1b(a::Matrix, b::Matrix, c::Vector)
for jdx = 2:size(a, 2)
a[:, jdx] .= a[:, jdx - 1] .+ b[:, jdx] .* c[jdx]
end
return a
end
function func2b(a::Matrix, b::Matrix, c::Vector)
for jdx = 2:size(a, 2)
for idx in axes(a, 1)
a[idx, jdx] = a[idx, jdx - 1] + b[idx, jdx] * c[jdx]
end
end
return a
end
function func1c(a, b, c)
@inbounds for jdx = 2:size(a, 2)
a[:, jdx] .= a[:, jdx - 1] .+ b[:, jdx] .* c[jdx]
end
return a
end
function func2c(a, b, c)
@inbounds for jdx = 2:size(a, 2)
for idx in axes(a, 1)
a[idx, jdx] = a[idx, jdx - 1] + b[idx, jdx] * c[jdx]
end
end
return a
end
function func2d(a, b, c)
@inbounds for jdx = 2:size(a, 2)
@simd for idx in axes(a, 1)
a[idx, jdx] = a[idx, jdx - 1] + b[idx, jdx] * c[jdx]
end
end
return a
end
function func1e(a, b, c)
@inbounds @views for jdx = 2:size(a, 2)
a[:, jdx] .= a[:, jdx - 1] .+ b[:, jdx] .* c[jdx]
end
return a
end
function func2e(a, b, c)
@inbounds for jdx = 2:size(a, 2)
for idx in axes(a, 1)
a[idx, jdx] = a[idx, jdx - 1] + b[idx, jdx] * c[jdx]
end
end
return a
end
``````
1 Like

Interpolation of variables in benchmarking sometimes matters a lot, and sometimes hardly at all. But it’s not necessarily easy to predict when it matters.

As for your code, the reason that `func1` is slower than `func2` is that slices produce copies, so that when you write

``````a[:, jdx - 1] .+ b[:, jdx]
``````

a copy is made of one column of both `a` and `b`, which has a performance cost. In `func2` there are no copies, and this is normally the fastest way.

You can get performance approaching that of `func2` if you use `view`s instead of copies, like this:

``````@views a[:, jdx] .= a[:, jdx - 1] .+ b[:, jdx] .* c[jdx]
``````

It should be faster, though perhaps not quite as fast as the double loop.

At least for me on 1.3, using `@views` is still more than a factor of 10x slower and has quite a bit of allocations. (It’s a bit better on master, but not much.) I was able to get a bit closer using `UnsafeArrays.jl` though:

``````function func3(a, b, c)
for jdx = 2:size(a, 2)
uview(a, :, jdx) .= uview(a, :, jdx - 1) .+ uview(b, :, jdx) .* c[jdx]
end
return a
end
``````

At least, this one doesn’t allocate anymore, but it’s still more than a factor of 2x slower than the handwritten loop.

Yeah, me too.

But here’s a weird one: If I add `@inbounds` in `func2` it slows down by 40%(!) What’s that about?

For me it’s only 20%, but I see the same thing

Sorry for not including the `@view`. I hade that in my original code but forgot to post it. I updated the table. The question still persists: Why do we have to manually unroll the loop?

The interpolation is just a red herring.

AFAIK, `view`s are just not sufficiently optimized yet. It’s a work in progress to remove the overhead, so that is expected, for now. The surprise is how much slower it is.

There’s also a proposal to handle nonscalar indexing via broadcasting:

There are potential downsides, and if we decide to do it we probably can’t implement it until Julia 2.0. But it’s a fascinating writeup.

5 Likes

Bounds checks are extremely cheap. `@inbounds` helps mostly because it allows code to vectorize by getting rid of branches.
If you add `@inbounds`, LLVM will vectorize the inner loop to be 4 * W iterations, where W is the vector width. It will then add a scalar loop to catch the remainder.

However, the inner loop here is only 3 iterations.
Meaning, regardless of `W` (which probably equals 2, 4, or 8 given double precision), `4W > 3`, meaning the vectorized part will not be run at all, and the primary difference between `@inbounds` and not-`@inbounds` is that one version generated a bunch of code that is not actually run.

However, on a computer where `W = 8`:

``````julia> using BenchmarkTools

julia> function func2!(a, b, c)
for jdx = 2:size(a, 2)
for idx in axes(a, 1)
a[idx, jdx] = a[idx, jdx - 1] + b[idx, jdx] * c[jdx]
end
end
return a
end
func2! (generic function with 1 method)

julia> function func3!(a, b, c)
@inbounds for jdx = 2:size(a, 2)
@fastmath for idx in axes(a, 1)
a[idx, jdx] = a[idx, jdx - 1] + b[idx, jdx] * c[jdx]
end
end
return a
end
func3! (generic function with 1 method)

julia> a3 = zeros(3, 100);

julia> b3 = randn(3, 100);

julia> a31 = zeros(31, 100);

julia> b31 = randn(31, 100);

julia> a32 = zeros(32, 100);

julia> b32 = randn(32, 100);

julia> c = randn(100);

julia> @benchmark func2!(\$a3, \$b3, \$c)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     297.405 ns (0.00% GC)
median time:      297.699 ns (0.00% GC)
mean time:        301.657 ns (0.00% GC)
maximum time:     449.135 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     259

julia> @benchmark func3!(\$a3, \$b3, \$c)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     253.055 ns (0.00% GC)
median time:      254.978 ns (0.00% GC)
mean time:        258.785 ns (0.00% GC)
maximum time:     396.309 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     362

julia> @benchmark func2!(\$a31, \$b31, \$c)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.135 μs (0.00% GC)
median time:      2.139 μs (0.00% GC)
mean time:        2.175 μs (0.00% GC)
maximum time:     4.398 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     9

julia> @benchmark func3!(\$a31, \$b31, \$c)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     1.477 μs (0.00% GC)
median time:      1.481 μs (0.00% GC)
mean time:        1.516 μs (0.00% GC)
maximum time:     4.467 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     10

julia> @benchmark func2!(\$a32, \$b32, \$c)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.201 μs (0.00% GC)
median time:      2.206 μs (0.00% GC)
mean time:        2.309 μs (0.00% GC)
maximum time:     4.743 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     9

julia> @benchmark func3!(\$a32, \$b32, \$c)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     788.899 ns (0.00% GC)
median time:      789.970 ns (0.00% GC)
mean time:        803.397 ns (0.00% GC)
maximum time:     1.266 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     99
``````

I did not see the same performance hit from the extraneous `@inbounds` you did. But note the huge performance bump the `@inbounds` version saw as soon as there were 32 rows (4 * W): evaluating one extra row made the total time drop by half.

FWIW, the fastest version I tried:

``````julia> function func4!(a, b, c)
@inbounds for jdx = 2:size(a, 2)
@simd ivdep for idx in axes(a, 1)
a[idx, jdx] = muladd(b[idx, jdx], c[jdx], a[idx, jdx - 1])
end
end
return a
end
func4! (generic function with 1 method)

julia> @benchmark func4!(\$a32, \$b32, \$c)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     525.853 ns (0.00% GC)
median time:      534.086 ns (0.00% GC)
mean time:        544.551 ns (0.00% GC)
maximum time:     903.990 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     191
``````

`muladd` combined the separate multiplication and addition instructions into one, and `@simd videp` got rid of a lot of run-time aliasing checks. This benchmark was with 32 rows.

Is `3` representative of the number of rows your problem will have?
If it will always be 3, you could hard-code it. If it will always be low, and not change much, you could consider using vectors of StaticVectors instead of matrices.

5 Likes