Array broadcasting slower than numpy?

chunjiw · June 3, 2022, 8:33pm

@time let m = 10^7 + 19
	v = fill(0, m)
	w = copy(v)
	for _ in 1:100
		i = mod(rand(Int), m)
		v[1:i] .+= @view w[end-i+1:end]
		v[i+1:end] .+= @view w[1:end-i]
	end
end

import time
import numpy as np
import random

t0 = time.time()
m = 10000019
v = np.array([0]*m)
w = np.copy(v)
for _ in range(100):
    i = random.randint(0, m-1)
    if i > 0:
        v[:i] += w[-i:]
        v[i:] += w[:-i]
    else:
        v += w
print(time.time() - t0)

On my machine, the Julia version takes 4.39 seconds, and the Python version takes only 1.85 seconds.
Is there any improvement to the Julia code to match numpy’s performance?

P.S. This came up when I was trying to do the Euler Project No. 655, during which I learned a DP method involving array broadcasting.

jling · June 3, 2022, 8:48pm

change to

               @views v[1:i] .+= w[end-i+1:end]
               @views v[i+1:end] .+= w[1:end-i]

Elrod · June 3, 2022, 9:24pm

Single threaded, I’m getting Base < FastBroadcast.jl < NumPy < LoopVectorization.jl:

julia> @time let m = 10^7 + 19
               v = fill(0, m)
               w = copy(v)
               for _ in 1:100
                       i = mod(rand(Int), m)
                       @views @. v[1:i] += w[end-i+1:end]
                       @views @. v[i+1:end] += w[1:end-i]
               end
       end
  3.530138 seconds (4 allocations: 152.588 MiB, 0.82% gc time)

julia> using FastBroadcast

julia> @time let m = 10^7 + 19
               v = fill(0, m)
               w = copy(v)
               for _ in 1:100
                       i = mod(rand(Int), m)
                       @views @.. v[1:i] += w[end-i+1:end]
                       @views @.. v[i+1:end] += w[1:end-i]
               end
       end
  3.177666 seconds (4 allocations: 152.588 MiB, 0.91% gc time)

julia> using LoopVectorization

julia> @time let m = 10^7 + 19
               v = fill(0, m)
               w = copy(v)
               for _ in 1:100
                       i = mod(rand(Int), m);
                       @turbo view(v,1:i) .+= view(w,m-i+1:m)
                       @turbo view(v,i+1:m) .+= view(w,1:m-i)
               end
       end
  1.564778 seconds (4 allocations: 152.588 MiB)

Python:

In [3]: import time
   ...: import numpy as np
   ...: import random
   ...:
   ...: t0 = time.time()
   ...: m = 10000019
   ...: v = np.array([0]*m)
   ...: w = np.copy(v)
   ...: for _ in range(100):
   ...:     i = random.randint(0, m-1)
   ...:     if i > 0:
   ...:         v[:i] += w[-i:]
   ...:         v[i:] += w[:-i]
   ...:     else:
   ...:         v += w
   ...: print(time.time() - t0)
1.9152345657348633

FastBroadcast.jl and LoopVectorization.jl are easy to multithread:

julia> @time let m = 10^7 + 19
               v = fill(0, m)
               w = copy(v)
               for _ in 1:100
                       i = mod(rand(Int), m)
                       @views @.. thread=true v[1:i] += w[end-i+1:end]
                       @views @.. thread=true v[i+1:end] += w[1:end-i]
               end
       end
  0.361589 seconds (4 allocations: 152.588 MiB, 2.10% gc time)

julia> @time let m = 10^7 + 19
               v = fill(0, m)
               w = copy(v)
               for _ in 1:100
                       i = mod(rand(Int), m);
                       @tturbo view(v,1:i) .+= view(w,m-i+1:m)
                       @tturbo view(v,i+1:m) .+= view(w,1:m-i)
               end
       end
  0.372530 seconds (4 allocations: 152.588 MiB)

FastBroadcast and base broadcast are failing to SIMD here, but this code is so memory bound when multithreaded that it doesn’t matter.

Elrod · June 3, 2022, 9:49pm

The problem, as it so often is, was with timing in the global scope.

julia> broadbench() = let m = 10^7 + 19
               v = fill(0, m)
               w = copy(v)
               for _ in 1:100
                       i = mod(rand(Int), m)
                       @views @. v[1:i] += w[end-i+1:end]
                       @views @. v[i+1:end] += w[1:end-i]
               end
       end
broadbench (generic function with 1 method)

julia> fastbroadbench() = let m = 10^7 + 19
               v = fill(0, m)
               w = copy(v)
               for _ in 1:100
                       i = mod(rand(Int), m)
                       @views @.. v[1:i] += w[end-i+1:end]
                       @views @.. v[i+1:end] += w[1:end-i]
               end
       end
fastbroadbench (generic function with 1 method)

julia> broadbench() = let m = 10^7 + 19
               v = fill(0, m)
               w = copy(v)
               for _ in 1:100
                       i = mod(rand(Int), m)
                       @views @. v[1:i] += w[end-i+1:end]
                       @views @. v[i+1:end] += w[1:end-i]
               end
       end
broadbench (generic function with 1 method)

julia> turbobroadbench() = let m = 10^7 + 19
               v = fill(0, m)
               w = copy(v)
               for _ in 1:100
                       i = mod(rand(Int), m);
                       @turbo view(v,1:i) .+= view(w,m-i+1:m)
                       @turbo view(v,i+1:m) .+= view(w,1:m-i)
               end
       end
turbobroadbench (generic function with 1 method)

This yields:

julia> @time broadbench()
  1.337521 seconds (4 allocations: 152.588 MiB, 1.81% gc time)

julia> @time fastbroadbench()
  1.331893 seconds (4 allocations: 152.588 MiB, 1.49% gc time)

julia> @time turbobroadbench()
  1.331875 seconds (4 allocations: 152.588 MiB, 1.35% gc time)

All much better than Python.

lmiq · June 3, 2022, 10:14pm

Why? Shouldn’t the slices on the left always be views?

Elrod · June 3, 2022, 10:23pm

+= is the problem.

julia> Meta.@lower v[1:i] .+= @view w[end-i+1:end]
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1  = 1:i
│   %2  = Base.dotview(v, %1)
│   %3  = Base.getindex(v, %1)
└──       goto #3 if not true
2 ─ %5  = (lastindex)(w)
│   %6  = %5 - i
│   %7  = %6 + 1
│   %8  = (lastindex)(w)
│   %9  = %7:%8
│         #s13 = (view)(w, %9)
└──       goto #4
3 ─       #s13 = false
4 ┄ %13 = #s13
│   %14 = Base.broadcasted(+, %3, %13)
│   %15 = Base.materialize!(%2, %14)
└──       return %15
))))

We write into a view, but we add using a getindex(v, 1:i).

jling · June 3, 2022, 10:37pm

Should this be chaged? CAN this be changed?

chunjiw · June 3, 2022, 10:47pm

Wow it really makes a difference. Still I’m wondering why @views is different from @view here?

Benny · June 3, 2022, 10:47pm

I’m not sure if this is a simpler way to look at it, but to quote the docs, “X .+= Y etcetera is equivalent to X .= X .+ Y”. So v[1:i] .+= @view w[end-i+1:end] is equivalent to v[1:i] .= v[1:i] .+ @view w[end-i+1:end]. Just a hazard of the shorthand +='s left side actually representing duplicate subexpressions on the left and right side of the actual assignment.

I am actually surprised here, because the code was in a local let block to begin with, and all the benchmarks are first-call @time. I can’t spot any globals, and the fact that it’s always 4 allocations and 152 MiB suggests to me there weren’t any type instabilities being fixed.

chunjiw · June 3, 2022, 10:50pm

Immediately answered by Benny’s reply lol

Elrod · June 3, 2022, 11:09pm

Perhaps it is more accurate to say the problem is that it isn’t compiled, and the overhead is high.
Dispatches often result in zero allocations.

This allocates a lot because it compiles:

julia> @time let m = 10^7 + 19; (m -> begin
               v = fill(0, m)
               w = copy(v)
               for _ in 1:100
                       i = mod(rand(Int), m)
                       @views @. v[1:i] += w[end-i+1:end]
                       @views @. v[i+1:end] += w[1:end-i]
               end
       end)(m); end
  1.451741 seconds (369.68 k allocations: 177.840 MiB, 1.52% gc time, 6.56% compilation time)

but is still >2x faster, despite the compilation.

Benny · June 3, 2022, 11:33pm

Ah that’s right, Julia compiles per first method call. The let block may have no type instabilities, but it’s not going to benefit from the optimizing compiler. Makes sense, why bother optimizing something that can’t run repeatedly?

giordano · June 3, 2022, 11:37pm

The difference is that

v[1:i] .+= @view w[end-i+1:end]

applies the view only to w[end-i+1:end], instead

@views v[1:i] .+= w[end-i+1:end]

applies the view also to v[1:i].

stevengj · June 3, 2022, 11:48pm

Note also that it’s a lot easier to just apply @views to the whole let block (or the whole function); there’s typically not much reason to apply it line-by-line.

jling · June 4, 2022, 12:15am

Honestly should we change this? What’s the possible use to the current behavior?

giordano · June 4, 2022, 12:33am

I think the current behaviour makes some sense: a += b is lowered to a = a + b:

julia> Meta.@lower a += b
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = a + b
│        a = %1
└──      return %1
))))

and since a view wasn’t explicitly applied to a in the right-hand side, the a in left-hand side doesn’t have a view either.

Benny · June 4, 2022, 12:33am

The @view docs seem pretty comprehensive? @view for individual indexing subexpressions, @views to slap @view on every place in an expression that would be a getindex, and @view works on the left side of a broadcasted assignment (.=) even with updating operators (.+=).

jling · June 4, 2022, 12:35am

Yeah I guess there isn’t any natural place where we can special case this huh…

carstenbauer · June 4, 2022, 5:26am

Let’s just default to views in Julia 2.0

gustaphe · June 4, 2022, 5:58am

While we’re optimizing, you could try making m an Int by writing it the same way you did in Python (and can we maybe have 1E7 or something be an integer in Julia 2.0?)

Topic		Replies	Views
Understanding major order performance when broadcasting in column vs row operations Performance question , array , benchmark	9	1005	June 21, 2021
Migrating code from Matlab: Accessing arrays without allocating Performance	20	1947	November 19, 2018
Performance of array broadcasting General Usage question	11	1190	July 8, 2017
Comparing Python, Julia, and C++ Performance broadcast , python	21	33235	November 1, 2018
Vector addition in Julia slower than numpy in Linux Performance	21	1297	May 1, 2020

Array broadcasting slower than numpy?

Related topics