Is it worth trying to speed up setindex for arrays of numbers where an array is overwritten with another?

Arrays of numbers (often bitstype) are fairly common in numerical work, eg Array{Float64}.

I’m thinking of an operation such as

julia> a = ones(1000, 1000);

julia> b = copy(a);

julia> @btime $a[:,:] = $b;
  1.877 ms (0 allocations: 0 bytes)

julia> @btime copyto!($a, $b);
  985.943 μs (0 allocations: 0 bytes)

# It's not bounds checking either that's slowing things down, perhaps vectorization?
julia> f(a, b) = @inbounds a[:,:] = b;

julia> @btime f($a, $b);
  1.870 ms (0 allocations: 0 bytes)

# broadcasted setindex with a view is faster, but still much slower than copyto
julia> @btime $a[:,:] .= $b;
  1.572 ms (0 allocations: 0 bytes)

# direct inplace broadcasting is as fast as copyto!
julia> @btime $a .= $b;
  983.905 μs (0 allocations: 0 bytes)

The setindex! in the first case is equivalent to the broadcasted version, but the latter is doubly fast. There are instances of the opposite too, eg.

julia> @btime $a[1:size($a,1), 1:size($a,2)] = $b;
  1.873 ms (0 allocations: 0 bytes)

julia> @btime $a[1:size($a,1), 1:size($a,2)] .= $b;
  3.982 ms (0 allocations: 0 bytes)

The difference here is that the view is a SlowSubArray in this case while it was a FastSubArray in the former.

Does it make sense to identify such cases where a potentially slow operation may be replaced by a faster one, and use the faster implementation instead? I’m not sure how generic this will be, however this might improve the performance in common applications. Otherwise it requires the user to retain a list of which operation is faster in each scenario to obtain optimal performance.

Ideally all these operations would behave identically, but I’m not sure if it’s easy to get there.

I had posted an issue about this a which seemingly didn’t get much attention, so I thought about discussing this here.