Is there a way to make a broadcasted values assignment faster than a for loop when the subset of indexes on which the assignment is done is arbitrary?
I explain with this simple example:
N = 10000;
a = ones(N); ##Original vector
b = zeros(N); ##New values to assign to a
w = collect(1:N)[rand(N).> 0.5]; ##random indices where a will be assigned the values of b
Now assignment with broadcasting:
@benchmark $a[$w] .= $b[$w]
gives:
memory estimate: 3.81 MiB
allocs estimate: 2
--------------
minimum time: 1.728 ms (0.00% GC)
median time: 1.961 ms (0.00% GC)
mean time: 2.561 ms (12.68% GC)
maximum time: 11.362 ms (70.52% GC)
--------------
samples: 1944
evals/sample: 1
and a for loop assignment (without changing w):
@benchmark for i in $w
$a[i] = $b[i]
end
gives:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 880.235 ÎĽs (0.00% GC)
median time: 918.504 ÎĽs (0.00% GC)
mean time: 930.052 ÎĽs (0.00% GC)
maximum time: 1.790 ms (0.00% GC)
--------------
samples: 5313
evals/sample: 1
which is more than twice as fast and doesn’t allocate any memory.
Is this linked to the creation of a view of a when assigning using a broadcast?
Is there a way to make the assignment written as concisely as using dots and as fast as a for loop?
Thank you for the suggestion, but the non-broadcasted version and setindex! become “problematic” as soon as the assignment involves an operation over vectors like a = 2 * b or a = b + c (or a = 2 .* b) since the resulting vector has to be computed before the assignment.
In that case you probably want something like this:
julia> N=10000;
julia> a = ones(N);
julia> b = zeros(N);
julia> w = collect(1:N)[rand(N).> 0.5];
julia> function f!(a,b,w) # mutating
for i in w
@inbounds a[i] = b[i]
end
end
f! (generic function with 2 methods)
julia> @btime f!($a,$b,$w)
2.901 ÎĽs (0 allocations: 0 bytes)
julia> function f(b,w) # non-mutating
a = similar(b,size(w))
j = 0
for i in w
j += 1
@inbounds a[j] = b[i]
end
return a
end
f (generic function with 1 method)
julia> @btime f($b,$w)
3.029 ÎĽs (2 allocations: 39.02 KiB)
I guess it is possible to a broadcasted operation be lowered to something like that, but making that work in general is possibly very hard. Using views probably don’t help much either, since the view has to be discontinuous in memory, thus it has to be an array of pointers of the same size.
Another problem here is that, the broadcasted version uses two indexsets, while non-broadcasted version uses only one:
@benchmark $a[$w] .= $b[$w]
which lowers roughly to:
#`w1` is the w in $a[$w] , `w2` is the w in $b[$w]
for i in 1:length(w1)
a[w1[i]] = b[w2[i]]
end
So you have two w here (two indexsets w1,w2). To get value from b and set the value to a, we need to perform two loads from these two w to get the indices. One load is redundant since w1 == w2, however compiler doesn’t know it since when you pass the w1, w2 into the broadcasted function this information is lost. So compiler has to assume that w1 is generally not equal to w2. In contrast, the non-broadcasted only performs one load.
Thanks! But this solution is still subject to the problem I mentioned in my reply:
The non-broadcasted version and setindex! become “problematic” as soon as the assignment involves an operation over vectors like a = 2 * b or a = b + c (or a = 2 .* b ) since the resulting vector has to be computed before the assignment.
Because the arguments are evaluated before being passed to the functions, this will create a new array (the result of the operations, like 2 .* b). This doesn’t happen when using a for loop, where the result of an operation is calculated and then gets assigned element-wise, avoiding allocations.
Sorry if I am not understanding you well, but do you mean writing a specific function for each different assignment involving different operations, or writing something general like this (or something else?)?
function set_test!(func, a, w, b...)
func_arg = zeros(length(b))
for i in w
for f_a=1:length(func_arg)
func_arg[f_a] = b[f_a][i]
end
a[i] = func(func_arg...)
end
end
I tried set_test! and it is actually is quite slow.
I mean that if you want to compute a = 2*b, you just do it explicitly in the loop:
julia> function f(b,w) # non-mutating
a = similar(b,size(w))
j = 0
for i in w
j += 1
@inbounds a[j] = 2*b[i]
end
return a
end
f (generic function with 1 method)
And, if you want this function to be somewhat more generic, pass the function to be evaluated for each b:
julia> function f(func,b,w) # non-mutating
a = similar(b,size(w))
j = 0
for i in w
j += 1
@inbounds a[j] = func(b[i])
end
return a
end
f (generic function with 2 methods)
julia> g(x) = 2*x + 1
g (generic function with 1 method)
julia> f(g,b,w)
4987-element Vector{Float64}:
1.0
1.0
...
This can be progressively generalized to receive more than one vector, etc. But one thing is to obtain a fast implementation of what you want to compute specifically, other thing is to implement something that is very generic. These are two different goals, which are sometimes compatible, sometimes less so (the more generic, the harder is to keep the code performant - and the good thing about Julia is that you can just solve your problem very efficiently, without having to rely on some generic approach someone implemented in a library that should fit all needs).
Thank you. My initial idea was to find something as fast as a for loop implementation and as concise as the dot notation, so I actually was hoping for a generic approach that would have the best of both worlds (so that I don’t have to write a different function every time the number of arguments on the rhs of the assignment changes). I thought that this question was so simple (because the function would boil down to a for loop) that I was simply missing an obvious function or macro, but I am learning that it’s not case
It’s slower because you need an additional @inbounds in the function:
function set_awfb!(func::F, a, w, b::Vararg{T, N}) where {F, T, N}
for i in w
@inbounds a[i] = func(map(x -> (@inbounds x[i]), b)...)
end
end
And this proves what I have said before, the main cause of slowness comes from the redundant load. After removing the load (and bound checking), you restore the full performance.
I think you should just write a for-loop instead of designing these broken abstractions. It’s not quite easy for a normal user to do this. To do this generally, you either need to perform static analysis to decide whether two arrays use the same indexset (which is expensive and extremely hard), or you perform a runtime test, e.g., you perform a comparison of w1 and w2 which is just a cheap arithmetic, but then you need to generate optimistic codes depending on the comparison result at runtime, which is infeasible currently in Julia (we can only depend on types, not general values).
You can also use macros, but it’s just hack and not general. I don’t recommend you to do so, writing a loop is much more clear than using a macro.