Normal vs broadcasted slice assignment

This must have been discussed a dozen times but I couldn’t find a thread about this precise issue:

using BenchmarkTools

f1(v, x) = v[1:length(x)] = x
f2(v, x) = v[1:length(x)] .= x

julia> @btime f1($(rand(1000)), $(rand(100)));
  10.268 ns (0 allocations: 0 bytes)

julia> @btime f2($(rand(1000)), $(rand(100)));
  25.415 ns (0 allocations: 0 bytes)

Is this expected? Does it have to do with unaliasing? I wonder what’s causing the slowdown exactly since there’s 0 allocation in both cases.

No, I think this is just the difference between a highly specialized memcpy that hits the Vector’s memory directly and a hand-written for loop that works with all abstract arrays.

These both should turn into memcpy. IMO this is unexpected.

So the difference is that f1 turns into a copyto!(view(a, 1:100), b) while f2 turns into a setindex!. So the problem is just that we don’t have an optimized method for copying a view of an Array to another Array.

Another difference is that f2 returns a view of v, while f1 returns x. Changing both functions to return nothing improves the performance of f2 somewhat, although not enough to make up the difference.

In

Chris Elrod had suggested that the differences arise from non-temporal stores, and that LoopVectorization provides a Julia equivalent.