Broadcasted assignment slower than for loop when indices are arbitrary

mehdi · August 20, 2021, 1:05pm

Is there a way to make a broadcasted values assignment faster than a for loop when the subset of indexes on which the assignment is done is arbitrary?

I explain with this simple example:

N = 10000;
a = ones(N); ##Original vector
b = zeros(N); ##New values to assign to a
w = collect(1:N)[rand(N).> 0.5]; ##random indices where a will be assigned the values of b

Now assignment with broadcasting:

 @benchmark $a[$w] .= $b[$w]

gives:

memory estimate:  3.81 MiB
  allocs estimate:  2
  --------------
  minimum time:     1.728 ms (0.00% GC)
  median time:      1.961 ms (0.00% GC)
  mean time:        2.561 ms (12.68% GC)
  maximum time:     11.362 ms (70.52% GC)
  --------------
  samples:          1944
  evals/sample:     1

and a for loop assignment (without changing w):

 @benchmark for i in $w
    $a[i] = $b[i]
end

gives:

  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     880.235 μs (0.00% GC)
  median time:      918.504 μs (0.00% GC)
  mean time:        930.052 μs (0.00% GC)
  maximum time:     1.790 ms (0.00% GC)
  --------------
  samples:          5313
  evals/sample:     1

which is more than twice as fast and doesn’t allocate any memory.

Is this linked to the creation of a view of a when assigning using a broadcast?
Is there a way to make the assignment written as concisely as using dots and as fast as a for loop?

gbaraldi · August 20, 2021, 2:01pm

You can make the dot assignment a bit faster by putting @inbounds before it.

stevengj · August 20, 2021, 3:10pm

The right-hand side makes a copy here.

mehdi · August 20, 2021, 4:13pm

Yes, sorry, I forgot to put the benchmark, but it’s actually even slower with a view:

@benchmark $a[$w] .= view($b, $w)

gives:

memory estimate:  7.62 MiB
  allocs estimate:  4
  --------------
  minimum time:     2.302 ms (0.00% GC)
  median time:      2.837 ms (0.00% GC)
  mean time:        4.060 ms (17.25% GC)
  maximum time:     13.462 ms (57.92% GC)
  --------------
  samples:          1228
  evals/sample:     1

This seems to confirm that it is the view creation that takes time (because the indexing vector is arbitrary?).

jishnub · August 20, 2021, 5:11pm

From what I see

julia> @btime $b[$w];
  9.685 μs (2 allocations: 39.39 KiB)

julia> @btime Base.setindex!($a, $(b[w]), $w);
  6.235 μs (0 allocations: 0 bytes)

So it’s computing b[w] that takes time. We may skip this and use a view instead.

julia> @btime for i in $w
          $a[i] = $b[i]
       end;
  7.330 μs (0 allocations: 0 bytes)

julia> @btime $a[$w] = @view $b[$w];
  10.459 μs (0 allocations: 0 bytes)

julia> @btime $a[$w] .= $b[$w];
  20.103 μs (2 allocations: 39.39 KiB)

The broadcasted assignment is twice as slow compared to the non-broadcasted version. This might be the same issue as https://github.com/JuliaLang/julia/issues/40962

Turning off bounds checking helps a bit

julia> @btime @inbounds for i in $w
          $a[i] = $b[i]
       end;
  6.951 μs (0 allocations: 0 bytes)

julia> @btime @inbounds $a[$w] = @view $b[$w];
  7.671 μs (0 allocations: 0 bytes)

julia> @btime @inbounds $a[$w] .= $b[$w];
  15.624 μs (2 allocations: 39.39 KiB)

With this the loop and the non-broadcasted assignment are close in performance.

mehdi · August 20, 2021, 7:41pm

Thank you for the suggestion, but the non-broadcasted version and setindex! become “problematic” as soon as the assignment involves an operation over vectors like a = 2 * b or a = b + c (or a = 2 .* b) since the resulting vector has to be computed before the assignment.

lmiq · August 21, 2021, 7:28pm

In that case you probably want something like this:

julia> N=10000;

julia> a = ones(N);

julia> b = zeros(N);

julia> w = collect(1:N)[rand(N).> 0.5];

julia> function f!(a,b,w) # mutating
         for i in w
           @inbounds a[i] = b[i]
         end
       end
f! (generic function with 2 methods)

julia> @btime f!($a,$b,$w)
  2.901 μs (0 allocations: 0 bytes)

julia> function f(b,w) # non-mutating
         a = similar(b,size(w))
         j = 0
         for i in w
           j += 1
           @inbounds a[j] = b[i]
         end
         return a
       end
f (generic function with 1 method)

julia> @btime f($b,$w)
  3.029 μs (2 allocations: 39.02 KiB)

I guess it is possible to a broadcasted operation be lowered to something like that, but making that work in general is possibly very hard. Using views probably don’t help much either, since the view has to be discontinuous in memory, thus it has to be an array of pointers of the same size.

anon56330260 · August 22, 2021, 5:16am

Another problem here is that, the broadcasted version uses two indexsets, while non-broadcasted version uses only one:

@benchmark $a[$w] .= $b[$w]

which lowers roughly to:

#`w1` is the w in $a[$w] , `w2` is the w in $b[$w]
for i in 1:length(w1)
    a[w1[i]] = b[w2[i]]
end

So you have two w here (two indexsets w1,w2). To get value from b and set the value to a, we need to perform two loads from these two w to get the indices. One load is redundant since w1 == w2, however compiler doesn’t know it since when you pass the w1, w2 into the broadcasted function this information is lost. So compiler has to assume that w1 is generally not equal to w2. In contrast, the non-broadcasted only performs one load.

mehdi · August 22, 2021, 2:16pm

Thanks! But this solution is still subject to the problem I mentioned in my reply:

The non-broadcasted version and setindex! become “problematic” as soon as the assignment involves an operation over vectors like a = 2 * b or a = b + c (or a = 2 .* b ) since the resulting vector has to be computed before the assignment.

Because the arguments are evaluated before being passed to the functions, this will create a new array (the result of the operations, like 2 .* b). This doesn’t happen when using a for loop, where the result of an operation is calculated and then gets assigned element-wise, avoiding allocations.

lmiq · August 22, 2021, 2:25pm

The idea is that you change the loop such that it evaluates what you want. You can pass a function to be applied element wise as well.

mehdi · August 22, 2021, 3:14pm

Sorry if I am not understanding you well, but do you mean writing a specific function for each different assignment involving different operations, or writing something general like this (or something else?)?

function set_test!(func, a, w, b...)
    func_arg = zeros(length(b))
    for i in w
        for f_a=1:length(func_arg)
            func_arg[f_a] = b[f_a][i]
        end
        a[i] = func(func_arg...)
    end
end

I tried set_test! and it is actually is quite slow.

lmiq · August 22, 2021, 4:10pm

I mean that if you want to compute a = 2*b, you just do it explicitly in the loop:

julia> function f(b,w) # non-mutating
                a = similar(b,size(w))
                j = 0
                for i in w
                  j += 1
                  @inbounds a[j] = 2*b[i]
                end
                return a
              end
f (generic function with 1 method)

And, if you want this function to be somewhat more generic, pass the function to be evaluated for each b:

julia> function f(func,b,w) # non-mutating
                a = similar(b,size(w))
                j = 0
                for i in w
                  j += 1
                  @inbounds a[j] = func(b[i])
                end
                return a
              end
f (generic function with 2 methods)

julia> g(x) = 2*x + 1
g (generic function with 1 method)

julia> f(g,b,w)
4987-element Vector{Float64}:
 1.0
 1.0
...

This can be progressively generalized to receive more than one vector, etc. But one thing is to obtain a fast implementation of what you want to compute specifically, other thing is to implement something that is very generic. These are two different goals, which are sometimes compatible, sometimes less so (the more generic, the harder is to keep the code performant - and the good thing about Julia is that you can just solve your problem very efficiently, without having to rely on some generic approach someone implemented in a library that should fit all needs).

mehdi · August 22, 2021, 5:55pm

Thank you. My initial idea was to find something as fast as a for loop implementation and as concise as the dot notation, so I actually was hoping for a generic approach that would have the best of both worlds (so that I don’t have to write a different function every time the number of arguments on the rhs of the assignment changes). I thought that this question was so simple (because the function would boil down to a for loop) that I was simply missing an obvious function or macro, but I am learning that it’s not case

mehdi · August 22, 2021, 6:38pm

Ok this seems to be working fine (not as fast as writing a for loop but a “good” compromise):

function set_awfb!(func::F, a, w, b::Vararg{T, N}) where {F, T, N}
    for i in w
        @inbounds a[i] = func(map(x -> x[i], b)...)
    end
end

which is a generic version of the function suggested by @lmiq and I used splatting with map (suggestion found here: Work around splatting to avoid unecessary allocations) and type declaration (suggestion found here: Splatting arguments causes ~30x slow down) to avoid allocations.

An idea of the performance with a simple addition:

N = 10000;
a = ones(N); b = rand(N); c = rand(N);
w = collect(1:N)[rand(N) .< 0.5];

@btime set_awfb!(+, $a, $w, $b, $c)
  6.540 μs (0 allocations: 0 bytes)

@btime for i in $w
           @inbounds $a[i] = $b[i] + $c[i]
       end
  5.402 μs (0 allocations: 0 bytes)

@btime $a[$w] .= $b[$w] .+ $c[$w];
  17.878 μs (4 allocations: 78.03 KiB)

So still slower, but a good improvement compared to broadcasting!

dlakelan · August 23, 2021, 12:14am

I think you should be able to write a macro that works

anon56330260 · August 23, 2021, 8:10am

It’s slower because you need an additional @inbounds in the function:

function set_awfb!(func::F, a, w, b::Vararg{T, N}) where {F, T, N}
    for i in w
        @inbounds a[i] = func(map(x -> (@inbounds x[i]), b)...)
    end
end

And this proves what I have said before, the main cause of slowness comes from the redundant load. After removing the load (and bound checking), you restore the full performance.
I think you should just write a for-loop instead of designing these broken abstractions. It’s not quite easy for a normal user to do this. To do this generally, you either need to perform static analysis to decide whether two arrays use the same indexset (which is expensive and extremely hard), or you perform a runtime test, e.g., you perform a comparison of w1 and w2 which is just a cheap arithmetic, but then you need to generate optimistic codes depending on the comparison result at runtime, which is infeasible currently in Julia (we can only depend on types, not general values).
You can also use macros, but it’s just hack and not general. I don’t recommend you to do so, writing a loop is much more clear than using a macro.

Topic		Replies	Views
Understanding assignment without allocation New to Julia question	14	1707	November 12, 2019
Broadcasting slower than for-loop New to Julia	6	427	December 13, 2023
Performance of loops New to Julia broadcast , loops , aliasing	30	867	August 19, 2024
Array broadcasting slower than numpy? Performance	20	722	June 4, 2022
Migrating code from Matlab: Accessing arrays without allocating Performance	20	1947	November 19, 2018

Broadcasted assignment slower than for loop when indices are arbitrary

Related topics