Hello there,

I’m trying to figure out how to do matrix operations for 2D slices of a 3D array whose shape is (3,3,N) where N is a large number (e.g. N=10^6). I’d like to write a vectorized code of C=A+B’ using loop fusion as explained in https://julialang.org/blog/2017/01/moredots for the first 1024 slices of 2D arrays (:, :, 1:1024). However, I feel like I’m missing something as I always get unexpected allocations compared to the manual for-loops. Below I test 3 different versions that all do the same calculation:

```
const N = 1024
const NX = 3
function transpose_3darray_2dslice_manualloop!(v3::Array{T,3}, v1::Array{T,3}, v2::Array{T,3}) where {T}
for n in 1:N
for i in 1:NX
for j in 1:NX
v3[j,i,n] = v1[j,i,n] + v2[i,j,n]
end
end
end
end
function transpose_3darray_2dslice_views!(v3::Array{T,3}, v1::Array{T,3}, v2::Array{T,3}) where {T}
for n in 1:N
@views v3[:,:,n] = v1[:,:,n] .+ transpose(v2[:,:,n])
end
end
function transpose_3darray_2dslice_viewsdot!(v3::Array{T,3}, v1::Array{T,3}, v2::Array{T,3}) where {T}
for n in 1:N
@views v3[:,:,n] .= v1[:,:,n] .+ transpose(v2[:,:,n])
end
end
Random.seed!(1234);
v1 = rand(NX,NX,N*1000);
v2 = rand(NX,NX,N*1000);
v3 = similar(v1);
println("###### 3D array 2D slice transpose:")
println("### manual for-loop")
@localbtime transpose_3darray_2dslice_manualloop!(v3,v1,v2)
v = copy(v3)
println("### views, assign")
@localbtime transpose_3darray_2dslice_views!(v3,v1,v2)
if (v3!=v) error("results are not consistent!") end
println("### views, dot assign")
@localbtime transpose_3darray_2dslice_viewsdot!(v3,v1,v2)
if (v3!=v) error("results are not consistent!") end
```

And the corresponding benchmark (using https://github.com/rdeits/LocalScopeBenchmarks.jl) looks like the following:

```
###### 3D array 2D slice transpose:
### manual for-loop
14.683 μs (0 allocations: 0 bytes)
### views, assign
84.908 μs (3072 allocations: 288.00 KiB)
### views, dot assign
49.312 μs (3072 allocations: 192.00 KiB)
```

My question is: why do I still allocate so much memory even using `@views`

? I would have thought that with `@views`

and the in-place assignment `.=`

I could save lots of unnecessary allocations… Is it due to the SubArray objects (cf. Why is a manual in-place addition so much faster than += and .+= on range-indexed arrays?)? Is this the optimal version of a vectorized code (3x slower seems a bit frustrating…)? or is there another way to write a vectorized code that has a speed comparable to (say within a factor of 2x) that of a manual for-loop?

In my research, I routinely encounter similar data structures (3D array representing an array of matrices) and it would be very convenient to be able to do matrix operations (e.g. transpose, multiplication, elementwise operations, etc.) with vectorized codes that run as fast as manual loops! Many thanks!