# Array broadcasting slower than numpy?

``````@time let m = 10^7 + 19
v = fill(0, m)
w = copy(v)
for _ in 1:100
i = mod(rand(Int), m)
v[1:i] .+= @view w[end-i+1:end]
v[i+1:end] .+= @view w[1:end-i]
end
end
``````
``````import time
import numpy as np
import random

t0 = time.time()
m = 10000019
v = np.array([0]*m)
w = np.copy(v)
for _ in range(100):
i = random.randint(0, m-1)
if i > 0:
v[:i] += w[-i:]
v[i:] += w[:-i]
else:
v += w
print(time.time() - t0)
``````

On my machine, the Julia version takes 4.39 seconds, and the Python version takes only 1.85 seconds.
Is there any improvement to the Julia code to match numpyâ€™s performance?

P.S. This came up when I was trying to do the Euler Project No. 655, during which I learned a DP method involving array broadcasting.

change to

``````               @views v[1:i] .+= w[end-i+1:end]
@views v[i+1:end] .+= w[1:end-i]
``````
5 Likes

``````julia> @time let m = 10^7 + 19
v = fill(0, m)
w = copy(v)
for _ in 1:100
i = mod(rand(Int), m)
@views @. v[1:i] += w[end-i+1:end]
@views @. v[i+1:end] += w[1:end-i]
end
end
3.530138 seconds (4 allocations: 152.588 MiB, 0.82% gc time)

julia> @time let m = 10^7 + 19
v = fill(0, m)
w = copy(v)
for _ in 1:100
i = mod(rand(Int), m)
@views @.. v[1:i] += w[end-i+1:end]
@views @.. v[i+1:end] += w[1:end-i]
end
end
3.177666 seconds (4 allocations: 152.588 MiB, 0.91% gc time)

julia> using LoopVectorization

julia> @time let m = 10^7 + 19
v = fill(0, m)
w = copy(v)
for _ in 1:100
i = mod(rand(Int), m);
@turbo view(v,1:i) .+= view(w,m-i+1:m)
@turbo view(v,i+1:m) .+= view(w,1:m-i)
end
end
1.564778 seconds (4 allocations: 152.588 MiB)
``````

Python:

``````In [3]: import time
...: import numpy as np
...: import random
...:
...: t0 = time.time()
...: m = 10000019
...: v = np.array([0]*m)
...: w = np.copy(v)
...: for _ in range(100):
...:     i = random.randint(0, m-1)
...:     if i > 0:
...:         v[:i] += w[-i:]
...:         v[i:] += w[:-i]
...:     else:
...:         v += w
...: print(time.time() - t0)
1.9152345657348633
``````

``````julia> @time let m = 10^7 + 19
v = fill(0, m)
w = copy(v)
for _ in 1:100
i = mod(rand(Int), m)
@views @.. thread=true v[1:i] += w[end-i+1:end]
@views @.. thread=true v[i+1:end] += w[1:end-i]
end
end
0.361589 seconds (4 allocations: 152.588 MiB, 2.10% gc time)

julia> @time let m = 10^7 + 19
v = fill(0, m)
w = copy(v)
for _ in 1:100
i = mod(rand(Int), m);
@tturbo view(v,1:i) .+= view(w,m-i+1:m)
@tturbo view(v,i+1:m) .+= view(w,1:m-i)
end
end
0.372530 seconds (4 allocations: 152.588 MiB)
``````

FastBroadcast and base broadcast are failing to SIMD here, but this code is so memory bound when multithreaded that it doesnâ€™t matter.

2 Likes

The problem, as it so often is, was with timing in the global scope.

``````julia> broadbench() = let m = 10^7 + 19
v = fill(0, m)
w = copy(v)
for _ in 1:100
i = mod(rand(Int), m)
@views @. v[1:i] += w[end-i+1:end]
@views @. v[i+1:end] += w[1:end-i]
end
end
broadbench (generic function with 1 method)

julia> fastbroadbench() = let m = 10^7 + 19
v = fill(0, m)
w = copy(v)
for _ in 1:100
i = mod(rand(Int), m)
@views @.. v[1:i] += w[end-i+1:end]
@views @.. v[i+1:end] += w[1:end-i]
end
end
fastbroadbench (generic function with 1 method)

julia> broadbench() = let m = 10^7 + 19
v = fill(0, m)
w = copy(v)
for _ in 1:100
i = mod(rand(Int), m)
@views @. v[1:i] += w[end-i+1:end]
@views @. v[i+1:end] += w[1:end-i]
end
end
broadbench (generic function with 1 method)

julia> turbobroadbench() = let m = 10^7 + 19
v = fill(0, m)
w = copy(v)
for _ in 1:100
i = mod(rand(Int), m);
@turbo view(v,1:i) .+= view(w,m-i+1:m)
@turbo view(v,i+1:m) .+= view(w,1:m-i)
end
end
turbobroadbench (generic function with 1 method)
``````

This yields:

``````julia> @time broadbench()
1.337521 seconds (4 allocations: 152.588 MiB, 1.81% gc time)

1.331893 seconds (4 allocations: 152.588 MiB, 1.49% gc time)

1.331875 seconds (4 allocations: 152.588 MiB, 1.35% gc time)
``````

All much better than Python.

3 Likes

Why? Shouldnâ€™t the slices on the left always be views?

`+=` is the problem.

``````julia> Meta.@lower v[1:i] .+= @view w[end-i+1:end]
:(\$(Expr(:thunk, CodeInfo(
@ none within `top-level scope`
1 â”€ %1  = 1:i
â”‚   %2  = Base.dotview(v, %1)
â”‚   %3  = Base.getindex(v, %1)
â””â”€â”€       goto #3 if not true
2 â”€ %5  = (lastindex)(w)
â”‚   %6  = %5 - i
â”‚   %7  = %6 + 1
â”‚   %8  = (lastindex)(w)
â”‚   %9  = %7:%8
â”‚         #s13 = (view)(w, %9)
â””â”€â”€       goto #4
3 â”€       #s13 = false
4 â”„ %13 = #s13
â”‚   %14 = Base.broadcasted(+, %3, %13)
â”‚   %15 = Base.materialize!(%2, %14)
â””â”€â”€       return %15
))))
``````

We write into a view, but we add using a `getindex(v, 1:i)`.

2 Likes

Should this be chaged? CAN this be changed?

Wow it really makes a difference. Still Iâ€™m wondering why `@views` is different from `@view` here?

Iâ€™m not sure if this is a simpler way to look at it, but to quote the docs, â€ś`X .+= Y` etcetera is equivalent to `X .= X .+ Y`â€ť. So `v[1:i] .+= @view w[end-i+1:end]` is equivalent to `v[1:i] .= v[1:i] .+ @view w[end-i+1:end]`. Just a hazard of the shorthand `+=`'s left side actually representing duplicate subexpressions on the left and right side of the actual assignment.

I am actually surprised here, because the code was in a local `let` block to begin with, and all the benchmarks are first-call `@time`. I canâ€™t spot any globals, and the fact that itâ€™s always 4 allocations and 152 MiB suggests to me there werenâ€™t any type instabilities being fixed.

5 Likes

Perhaps it is more accurate to say the problem is that it isnâ€™t compiled, and the overhead is high.
Dispatches often result in zero allocations.

This allocates a lot because it compiles:

``````julia> @time let m = 10^7 + 19; (m -> begin
v = fill(0, m)
w = copy(v)
for _ in 1:100
i = mod(rand(Int), m)
@views @. v[1:i] += w[end-i+1:end]
@views @. v[i+1:end] += w[1:end-i]
end
end)(m); end
1.451741 seconds (369.68 k allocations: 177.840 MiB, 1.52% gc time, 6.56% compilation time)
``````

but is still >2x faster, despite the compilation.

2 Likes

Ah thatâ€™s right, Julia compiles per first method call. The let block may have no type instabilities, but itâ€™s not going to benefit from the optimizing compiler. Makes sense, why bother optimizing something that canâ€™t run repeatedly?

The difference is that

``````v[1:i] .+= @view w[end-i+1:end]
``````

applies the view only to `w[end-i+1:end]`, instead

``````@views v[1:i] .+= w[end-i+1:end]
``````

applies the view also to `v[1:i]`.

1 Like

Note also that itâ€™s a lot easier to just apply `@views` to the whole `let` block (or the whole function); thereâ€™s typically not much reason to apply it line-by-line.

5 Likes

Honestly should we change this? Whatâ€™s the possible use to the current behavior?

I think the current behaviour makes some sense: `a += b` is lowered to `a = a + b`:

``````julia> Meta.@lower a += b
:(\$(Expr(:thunk, CodeInfo(
@ none within `top-level scope`
1 â”€ %1 = a + b
â”‚        a = %1
â””â”€â”€      return %1
))))
``````

and since a view wasnâ€™t explicitly applied to `a` in the right-hand side, the `a` in left-hand side doesnâ€™t have a view either.

The @view docs seem pretty comprehensive? `@view` for individual indexing subexpressions, `@views` to slap `@view` on every place in an expression that would be a `getindex`, and `@view` works on the left side of a broadcasted assignment (`.=`) even with updating operators (`.+=`).

Yeah I guess there isnâ€™t any natural place where we can special case this huhâ€¦

Letâ€™s just default to views in Julia 2.0

3 Likes

While weâ€™re optimizing, you could try making `m` an `Int` by writing it the same way you did in Python (and can we maybe have `1E7` or something be an integer in Julia 2.0?)