# Array broadcasting slower than numpy?

``````@time let m = 10^7 + 19
v = fill(0, m)
w = copy(v)
for _ in 1:100
i = mod(rand(Int), m)
v[1:i] .+= @view w[end-i+1:end]
v[i+1:end] .+= @view w[1:end-i]
end
end
``````
``````import time
import numpy as np
import random

t0 = time.time()
m = 10000019
v = np.array(*m)
w = np.copy(v)
for _ in range(100):
i = random.randint(0, m-1)
if i > 0:
v[:i] += w[-i:]
v[i:] += w[:-i]
else:
v += w
print(time.time() - t0)
``````

On my machine, the Julia version takes 4.39 seconds, and the Python version takes only 1.85 seconds.
Is there any improvement to the Julia code to match numpy’s performance?

P.S. This came up when I was trying to do the Euler Project No. 655, during which I learned a DP method involving array broadcasting.

change to

``````               @views v[1:i] .+= w[end-i+1:end]
@views v[i+1:end] .+= w[1:end-i]
``````
5 Likes

``````julia> @time let m = 10^7 + 19
v = fill(0, m)
w = copy(v)
for _ in 1:100
i = mod(rand(Int), m)
@views @. v[1:i] += w[end-i+1:end]
@views @. v[i+1:end] += w[1:end-i]
end
end
3.530138 seconds (4 allocations: 152.588 MiB, 0.82% gc time)

julia> @time let m = 10^7 + 19
v = fill(0, m)
w = copy(v)
for _ in 1:100
i = mod(rand(Int), m)
@views @.. v[1:i] += w[end-i+1:end]
@views @.. v[i+1:end] += w[1:end-i]
end
end
3.177666 seconds (4 allocations: 152.588 MiB, 0.91% gc time)

julia> using LoopVectorization

julia> @time let m = 10^7 + 19
v = fill(0, m)
w = copy(v)
for _ in 1:100
i = mod(rand(Int), m);
@turbo view(v,1:i) .+= view(w,m-i+1:m)
@turbo view(v,i+1:m) .+= view(w,1:m-i)
end
end
1.564778 seconds (4 allocations: 152.588 MiB)
``````

Python:

``````In : import time
...: import numpy as np
...: import random
...:
...: t0 = time.time()
...: m = 10000019
...: v = np.array(*m)
...: w = np.copy(v)
...: for _ in range(100):
...:     i = random.randint(0, m-1)
...:     if i > 0:
...:         v[:i] += w[-i:]
...:         v[i:] += w[:-i]
...:     else:
...:         v += w
...: print(time.time() - t0)
1.9152345657348633
``````

``````julia> @time let m = 10^7 + 19
v = fill(0, m)
w = copy(v)
for _ in 1:100
i = mod(rand(Int), m)
@views @.. thread=true v[1:i] += w[end-i+1:end]
@views @.. thread=true v[i+1:end] += w[1:end-i]
end
end
0.361589 seconds (4 allocations: 152.588 MiB, 2.10% gc time)

julia> @time let m = 10^7 + 19
v = fill(0, m)
w = copy(v)
for _ in 1:100
i = mod(rand(Int), m);
@tturbo view(v,1:i) .+= view(w,m-i+1:m)
@tturbo view(v,i+1:m) .+= view(w,1:m-i)
end
end
0.372530 seconds (4 allocations: 152.588 MiB)
``````

FastBroadcast and base broadcast are failing to SIMD here, but this code is so memory bound when multithreaded that it doesn’t matter.

2 Likes

The problem, as it so often is, was with timing in the global scope.

``````julia> broadbench() = let m = 10^7 + 19
v = fill(0, m)
w = copy(v)
for _ in 1:100
i = mod(rand(Int), m)
@views @. v[1:i] += w[end-i+1:end]
@views @. v[i+1:end] += w[1:end-i]
end
end
broadbench (generic function with 1 method)

julia> fastbroadbench() = let m = 10^7 + 19
v = fill(0, m)
w = copy(v)
for _ in 1:100
i = mod(rand(Int), m)
@views @.. v[1:i] += w[end-i+1:end]
@views @.. v[i+1:end] += w[1:end-i]
end
end
fastbroadbench (generic function with 1 method)

julia> broadbench() = let m = 10^7 + 19
v = fill(0, m)
w = copy(v)
for _ in 1:100
i = mod(rand(Int), m)
@views @. v[1:i] += w[end-i+1:end]
@views @. v[i+1:end] += w[1:end-i]
end
end
broadbench (generic function with 1 method)

julia> turbobroadbench() = let m = 10^7 + 19
v = fill(0, m)
w = copy(v)
for _ in 1:100
i = mod(rand(Int), m);
@turbo view(v,1:i) .+= view(w,m-i+1:m)
@turbo view(v,i+1:m) .+= view(w,1:m-i)
end
end
turbobroadbench (generic function with 1 method)
``````

This yields:

``````julia> @time broadbench()
1.337521 seconds (4 allocations: 152.588 MiB, 1.81% gc time)

1.331893 seconds (4 allocations: 152.588 MiB, 1.49% gc time)

1.331875 seconds (4 allocations: 152.588 MiB, 1.35% gc time)
``````

All much better than Python.

3 Likes

Why? Shouldn’t the slices on the left always be views?

`+=` is the problem.

``````julia> Meta.@lower v[1:i] .+= @view w[end-i+1:end]
:(\$(Expr(:thunk, CodeInfo(
@ none within `top-level scope`
1 ─ %1  = 1:i
│   %2  = Base.dotview(v, %1)
│   %3  = Base.getindex(v, %1)
└──       goto #3 if not true
2 ─ %5  = (lastindex)(w)
│   %6  = %5 - i
│   %7  = %6 + 1
│   %8  = (lastindex)(w)
│   %9  = %7:%8
│         #s13 = (view)(w, %9)
└──       goto #4
3 ─       #s13 = false
4 ┄ %13 = #s13
│   %14 = Base.broadcasted(+, %3, %13)
│   %15 = Base.materialize!(%2, %14)
└──       return %15
))))
``````

We write into a view, but we add using a `getindex(v, 1:i)`.

2 Likes

Should this be chaged? CAN this be changed?

Wow it really makes a difference. Still I’m wondering why `@views` is different from `@view` here?

I’m not sure if this is a simpler way to look at it, but to quote the docs, "`X .+= Y` etcetera is equivalent to `X .= X .+ Y`". So `v[1:i] .+= @view w[end-i+1:end]` is equivalent to `v[1:i] .= v[1:i] .+ @view w[end-i+1:end]`. Just a hazard of the shorthand `+=`'s left side actually representing duplicate subexpressions on the left and right side of the actual assignment.

I am actually surprised here, because the code was in a local `let` block to begin with, and all the benchmarks are first-call `@time`. I can’t spot any globals, and the fact that it’s always 4 allocations and 152 MiB suggests to me there weren’t any type instabilities being fixed.

5 Likes

Perhaps it is more accurate to say the problem is that it isn’t compiled, and the overhead is high.
Dispatches often result in zero allocations.

This allocates a lot because it compiles:

``````julia> @time let m = 10^7 + 19; (m -> begin
v = fill(0, m)
w = copy(v)
for _ in 1:100
i = mod(rand(Int), m)
@views @. v[1:i] += w[end-i+1:end]
@views @. v[i+1:end] += w[1:end-i]
end
end)(m); end
1.451741 seconds (369.68 k allocations: 177.840 MiB, 1.52% gc time, 6.56% compilation time)
``````

but is still >2x faster, despite the compilation.

2 Likes

Ah that’s right, Julia compiles per first method call. The let block may have no type instabilities, but it’s not going to benefit from the optimizing compiler. Makes sense, why bother optimizing something that can’t run repeatedly?

The difference is that

``````v[1:i] .+= @view w[end-i+1:end]
``````

applies the view only to `w[end-i+1:end]`, instead

``````@views v[1:i] .+= w[end-i+1:end]
``````

applies the view also to `v[1:i]`.

1 Like

Note also that it’s a lot easier to just apply `@views` to the whole `let` block (or the whole function); there’s typically not much reason to apply it line-by-line.

5 Likes

Honestly should we change this? What’s the possible use to the current behavior?

I think the current behaviour makes some sense: `a += b` is lowered to `a = a + b`:

``````julia> Meta.@lower a += b
:(\$(Expr(:thunk, CodeInfo(
@ none within `top-level scope`
1 ─ %1 = a + b
│        a = %1
└──      return %1
))))
``````

and since a view wasn’t explicitly applied to `a` in the right-hand side, the `a` in left-hand side doesn’t have a view either.

The @view docs seem pretty comprehensive? `@view` for individual indexing subexpressions, `@views` to slap `@view` on every place in an expression that would be a `getindex`, and `@view` works on the left side of a broadcasted assignment (`.=`) even with updating operators (`.+=`).

Yeah I guess there isn’t any natural place where we can special case this huh…

Let’s just default to views in Julia 2.0 3 Likes

While we’re optimizing, you could try making `m` an `Int` by writing it the same way you did in Python (and can we maybe have `1E7` or something be an integer in Julia 2.0?)