Broadcasted assignment slower than for loop when indices are arbitrary

Is there a way to make a broadcasted values assignment faster than a for loop when the subset of indexes on which the assignment is done is arbitrary?

I explain with this simple example:

N = 10000;
a = ones(N); ##Original vector
b = zeros(N); ##New values to assign to a
w = collect(1:N)[rand(N).> 0.5]; ##random indices where a will be assigned the values of b

Now assignment with broadcasting:

 @benchmark $a[$w] .= $b[$w]

gives:

memory estimate:  3.81 MiB
  allocs estimate:  2
  --------------
  minimum time:     1.728 ms (0.00% GC)
  median time:      1.961 ms (0.00% GC)
  mean time:        2.561 ms (12.68% GC)
  maximum time:     11.362 ms (70.52% GC)
  --------------
  samples:          1944
  evals/sample:     1

and a for loop assignment (without changing w):

 @benchmark for i in $w
    $a[i] = $b[i]
end

gives:

  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     880.235 ÎĽs (0.00% GC)
  median time:      918.504 ÎĽs (0.00% GC)
  mean time:        930.052 ÎĽs (0.00% GC)
  maximum time:     1.790 ms (0.00% GC)
  --------------
  samples:          5313
  evals/sample:     1

which is more than twice as fast and doesn’t allocate any memory.

  • Is this linked to the creation of a view of a when assigning using a broadcast?

  • Is there a way to make the assignment written as concisely as using dots and as fast as a for loop?

1 Like

You can make the dot assignment a bit faster by putting @inbounds before it.

1 Like

The right-hand side makes a copy here.

1 Like

Yes, sorry, I forgot to put the benchmark, but it’s actually even slower with a view:

@benchmark $a[$w] .= view($b, $w)

gives:

memory estimate:  7.62 MiB
  allocs estimate:  4
  --------------
  minimum time:     2.302 ms (0.00% GC)
  median time:      2.837 ms (0.00% GC)
  mean time:        4.060 ms (17.25% GC)
  maximum time:     13.462 ms (57.92% GC)
  --------------
  samples:          1228
  evals/sample:     1

This seems to confirm that it is the view creation that takes time (because the indexing vector is arbitrary?).

From what I see

julia> @btime $b[$w];
  9.685 ÎĽs (2 allocations: 39.39 KiB)

julia> @btime Base.setindex!($a, $(b[w]), $w);
  6.235 ÎĽs (0 allocations: 0 bytes)

So it’s computing b[w] that takes time. We may skip this and use a view instead.

julia> @btime for i in $w
          $a[i] = $b[i]
       end;
  7.330 ÎĽs (0 allocations: 0 bytes)

julia> @btime $a[$w] = @view $b[$w];
  10.459 ÎĽs (0 allocations: 0 bytes)

julia> @btime $a[$w] .= $b[$w];
  20.103 ÎĽs (2 allocations: 39.39 KiB)

The broadcasted assignment is twice as slow compared to the non-broadcasted version. This might be the same issue as https://github.com/JuliaLang/julia/issues/40962

Turning off bounds checking helps a bit

julia> @btime @inbounds for i in $w
          $a[i] = $b[i]
       end;
  6.951 ÎĽs (0 allocations: 0 bytes)

julia> @btime @inbounds $a[$w] = @view $b[$w];
  7.671 ÎĽs (0 allocations: 0 bytes)

julia> @btime @inbounds $a[$w] .= $b[$w];
  15.624 ÎĽs (2 allocations: 39.39 KiB)

With this the loop and the non-broadcasted assignment are close in performance.

3 Likes

Thank you for the suggestion, but the non-broadcasted version and setindex! become “problematic” as soon as the assignment involves an operation over vectors like a = 2 * b or a = b + c (or a = 2 .* b) since the resulting vector has to be computed before the assignment.

In that case you probably want something like this:

julia> N=10000;

julia> a = ones(N);

julia> b = zeros(N);

julia> w = collect(1:N)[rand(N).> 0.5];

julia> function f!(a,b,w) # mutating
         for i in w
           @inbounds a[i] = b[i]
         end
       end
f! (generic function with 2 methods)

julia> @btime f!($a,$b,$w)
  2.901 ÎĽs (0 allocations: 0 bytes)

julia> function f(b,w) # non-mutating
         a = similar(b,size(w))
         j = 0
         for i in w
           j += 1
           @inbounds a[j] = b[i]
         end
         return a
       end
f (generic function with 1 method)

julia> @btime f($b,$w)
  3.029 ÎĽs (2 allocations: 39.02 KiB)

I guess it is possible to a broadcasted operation be lowered to something like that, but making that work in general is possibly very hard. Using views probably don’t help much either, since the view has to be discontinuous in memory, thus it has to be an array of pointers of the same size.

1 Like

Another problem here is that, the broadcasted version uses two indexsets, while non-broadcasted version uses only one:

@benchmark $a[$w] .= $b[$w]

which lowers roughly to:

#`w1` is the w in $a[$w] , `w2` is the w in $b[$w]
for i in 1:length(w1)
    a[w1[i]] = b[w2[i]]
end

So you have two w here (two indexsets w1,w2). To get value from b and set the value to a, we need to perform two loads from these two w to get the indices. One load is redundant since w1 == w2, however compiler doesn’t know it since when you pass the w1, w2 into the broadcasted function this information is lost. So compiler has to assume that w1 is generally not equal to w2. In contrast, the non-broadcasted only performs one load.

3 Likes

Thanks! But this solution is still subject to the problem I mentioned in my reply:

The non-broadcasted version and setindex! become “problematic” as soon as the assignment involves an operation over vectors like a = 2 * b or a = b + c (or a = 2 .* b ) since the resulting vector has to be computed before the assignment.

Because the arguments are evaluated before being passed to the functions, this will create a new array (the result of the operations, like 2 .* b). This doesn’t happen when using a for loop, where the result of an operation is calculated and then gets assigned element-wise, avoiding allocations.

The idea is that you change the loop such that it evaluates what you want. You can pass a function to be applied element wise as well.

Sorry if I am not understanding you well, but do you mean writing a specific function for each different assignment involving different operations, or writing something general like this (or something else?)?

function set_test!(func, a, w, b...)
    func_arg = zeros(length(b))
    for i in w
        for f_a=1:length(func_arg)
            func_arg[f_a] = b[f_a][i]
        end
        a[i] = func(func_arg...)
    end
end

I tried set_test! and it is actually is quite slow.

I mean that if you want to compute a = 2*b, you just do it explicitly in the loop:

julia> function f(b,w) # non-mutating
                a = similar(b,size(w))
                j = 0
                for i in w
                  j += 1
                  @inbounds a[j] = 2*b[i]
                end
                return a
              end
f (generic function with 1 method)

And, if you want this function to be somewhat more generic, pass the function to be evaluated for each b:

julia> function f(func,b,w) # non-mutating
                a = similar(b,size(w))
                j = 0
                for i in w
                  j += 1
                  @inbounds a[j] = func(b[i])
                end
                return a
              end
f (generic function with 2 methods)

julia> g(x) = 2*x + 1
g (generic function with 1 method)

julia> f(g,b,w)
4987-element Vector{Float64}:
 1.0
 1.0
...

This can be progressively generalized to receive more than one vector, etc. But one thing is to obtain a fast implementation of what you want to compute specifically, other thing is to implement something that is very generic. These are two different goals, which are sometimes compatible, sometimes less so (the more generic, the harder is to keep the code performant - and the good thing about Julia is that you can just solve your problem very efficiently, without having to rely on some generic approach someone implemented in a library that should fit all needs).

Thank you. My initial idea was to find something as fast as a for loop implementation and as concise as the dot notation, so I actually was hoping for a generic approach that would have the best of both worlds (so that I don’t have to write a different function every time the number of arguments on the rhs of the assignment changes). I thought that this question was so simple (because the function would boil down to a for loop) that I was simply missing an obvious function or macro, but I am learning that it’s not case :grin:

Ok this seems to be working fine (not as fast as writing a for loop but a “good” compromise):

function set_awfb!(func::F, a, w, b::Vararg{T, N}) where {F, T, N}
    for i in w
        @inbounds a[i] = func(map(x -> x[i], b)...)
    end
end

which is a generic version of the function suggested by @lmiq and I used splatting with map (suggestion found here: Work around splatting to avoid unecessary allocations) and type declaration (suggestion found here: Splatting arguments causes ~30x slow down) to avoid allocations.

An idea of the performance with a simple addition:

N = 10000;
a = ones(N); b = rand(N); c = rand(N);
w = collect(1:N)[rand(N) .< 0.5];

@btime set_awfb!(+, $a, $w, $b, $c)
  6.540 ÎĽs (0 allocations: 0 bytes)

@btime for i in $w
           @inbounds $a[i] = $b[i] + $c[i]
       end
  5.402 ÎĽs (0 allocations: 0 bytes)

@btime $a[$w] .= $b[$w] .+ $c[$w];
  17.878 ÎĽs (4 allocations: 78.03 KiB)

So still slower, but a good improvement compared to broadcasting!

I think you should be able to write a macro that works

It’s slower because you need an additional @inbounds in the function:

function set_awfb!(func::F, a, w, b::Vararg{T, N}) where {F, T, N}
    for i in w
        @inbounds a[i] = func(map(x -> (@inbounds x[i]), b)...)
    end
end

And this proves what I have said before, the main cause of slowness comes from the redundant load. After removing the load (and bound checking), you restore the full performance.
I think you should just write a for-loop instead of designing these broken abstractions. It’s not quite easy for a normal user to do this. To do this generally, you either need to perform static analysis to decide whether two arrays use the same indexset (which is expensive and extremely hard), or you perform a runtime test, e.g., you perform a comparison of w1 and w2 which is just a cheap arithmetic, but then you need to generate optimistic codes depending on the comparison result at runtime, which is infeasible currently in Julia (we can only depend on types, not general values).
You can also use macros, but it’s just hack and not general. I don’t recommend you to do so, writing a loop is much more clear than using a macro.

1 Like