Efficient vectorized matrix operations for 2D matrix slices of a 3D array?

Hello there,

I’m trying to figure out how to do matrix operations for 2D slices of a 3D array whose shape is (3,3,N) where N is a large number (e.g. N=10^6). I’d like to write a vectorized code of C=A+B’ using loop fusion as explained in https://julialang.org/blog/2017/01/moredots for the first 1024 slices of 2D arrays (:, :, 1:1024). However, I feel like I’m missing something as I always get unexpected allocations compared to the manual for-loops. Below I test 3 different versions that all do the same calculation:

const N = 1024
const NX = 3

function transpose_3darray_2dslice_manualloop!(v3::Array{T,3}, v1::Array{T,3}, v2::Array{T,3}) where {T}    
    for n in 1:N
        for i in 1:NX
            for j in 1:NX
                v3[j,i,n] = v1[j,i,n] + v2[i,j,n]
            end
        end
    end
end        
    
function transpose_3darray_2dslice_views!(v3::Array{T,3}, v1::Array{T,3}, v2::Array{T,3}) where {T}
    for n in 1:N
        @views v3[:,:,n] = v1[:,:,n] .+ transpose(v2[:,:,n])
    end
end

function transpose_3darray_2dslice_viewsdot!(v3::Array{T,3}, v1::Array{T,3}, v2::Array{T,3}) where {T}   
    for n in 1:N
        @views v3[:,:,n] .= v1[:,:,n] .+ transpose(v2[:,:,n])
    end
end

Random.seed!(1234); 
v1 = rand(NX,NX,N*1000); 
v2 = rand(NX,NX,N*1000); 
v3 = similar(v1);

println("###### 3D array 2D slice transpose:")
println("### manual for-loop")
@localbtime transpose_3darray_2dslice_manualloop!(v3,v1,v2)
v = copy(v3)

println("### views, assign")
@localbtime transpose_3darray_2dslice_views!(v3,v1,v2)
if (v3!=v) error("results are not consistent!") end

println("### views, dot assign")
@localbtime transpose_3darray_2dslice_viewsdot!(v3,v1,v2)
if (v3!=v) error("results are not consistent!") end

And the corresponding benchmark (using https://github.com/rdeits/LocalScopeBenchmarks.jl) looks like the following:

###### 3D array 2D slice transpose:
### manual for-loop
  14.683 μs (0 allocations: 0 bytes)
### views, assign
  84.908 μs (3072 allocations: 288.00 KiB)
### views, dot assign
  49.312 μs (3072 allocations: 192.00 KiB)

My question is: why do I still allocate so much memory even using @views? I would have thought that with @views and the in-place assignment .= I could save lots of unnecessary allocations… Is it due to the SubArray objects (cf. Why is a manual in-place addition so much faster than += and .+= on range-indexed arrays?)? Is this the optimal version of a vectorized code (3x slower seems a bit frustrating…)? or is there another way to write a vectorized code that has a speed comparable to (say within a factor of 2x) that of a manual for-loop?

In my research, I routinely encounter similar data structures (3D array representing an array of matrices) and it would be very convenient to be able to do matrix operations (e.g. transpose, multiplication, elementwise operations, etc.) with vectorized codes that run as fast as manual loops! Many thanks!

Since you’re working with such small slices, the cost of creating all those views (and doing the associated bookkeeping on transposes) is hard to amortize. Adding @inbounds speeds up the expanded loop even further, but if you can use StaticArrays, I think they give you the syntax you’re looking for:

using LinearAlgebra, StaticArrays, BenchmarkTools, Random
const N, NX = 1024, 3

function f1!(v3::Array{T,3}, v1::Array{T,3}, v2::Array{T,3}) where {T}    
    for n in 1:N, i in 1:NX, j in 1:NX
        v3[j,i,n] = v1[j,i,n] + v2[i,j,n]
    end
end     

function f2!(v3::Array{T,3}, v1::Array{T,3}, v2::Array{T,3}) where {T}    
   @inbounds for n in 1:N, i in 1:NX, j in 1:NX
       v3[j,i,n] = v1[j,i,n] + v2[i,j,n]
   end
end

function f3!(v3::S, v1::S, v2::S) where {S<:Vector{<:StaticArray}}
    @inbounds for n in 1:N
        v3[n] = v1[n] + transpose(v2[n])
    end
end

Random.seed!(1234); 
v1 = rand(NX,NX,N*1000); 
v2 = rand(NX,NX,N*1000); 
v3 = similar(v1);

s1 = [@SMatrix(rand(3,3)) for v in 1:1000N];
s2 = [@SMatrix(rand(3,3)) for v in 1:1000N];
s3 = similar(s2);

@btime f1!($v3, $v1, $v2)   # 17.172 μs (0 allocations: 0 bytes)
@btime f2!($v3, $v1, $v2)   # 5.536 μs (0 allocations: 0 bytes)
@btime f3!($s3, $s1, $s2)   # 3.903 μs (0 allocations: 0 bytes
1 Like

Thanks for the instant reply and the cool demonstration of StaticArrays! I got similar results with your code. However, I’m a bit confused about its applicability. I thought StaticArrays are stack allocated and so they have limited memory quota. Typing ulimit -a in my MacBook suggests that my stack space is 8 MB (which is consistent with the fact that I get a segfault if I try to do double a[1100000] in a simple C program). In your code, you allocate 1024000*9 Float64 for s1 alone! Why does it still work and not lead to segfault? A related question: since s1 is an array of StaticArrays, and arrays are allocated on heap, where exactly is s1 located? Can I safely use it up to the memory limit (say ~16GB on my laptop)?

Good question.

StaticArrays can be stored on the stack but since they’re isbits, they can also be stored immediately inline within an array. This means that the array [@SVector[1,2],@SVector[3,4]] is stored in memory simply as the sequence of the four Ints smashed together. This is in contrast to [[1,2], [3,4]], which simply stores two pointers to two discontinuous arrays.

1 Like

Thanks for the explanation. So in this case the entire s1 is located on heap which has no 8MB limit, right? The remaining issue is that I usually obtain s1 from hdf5 files where data has the format of a 3D array (e.g. s1 = h5read("data_3darray.h5", "group")). Therefore, I need to convert this 3x3x1000000 3D array into a vector of StaticArrays as suggested above (e.g. m1 = [MMatrix{3,3}(v1[:,:,n]) for n in 1:1000N]), this overhead is a daunting 0.310317 seconds (2.05 M allocations: 242.188 MiB)… Is there any faster way to do this conversion?

This can be done without copying
http://juliaarrays.github.io/StaticArrays.jl/latest/pages/api.html#Arrays-of-static-arrays-1

Worked like a charm! For reference, the syntax is s1 = reinterpret(SMatrix{3,3,Float64,9}, reshape(v1,9,1000000)). It doesn’t seem to be compatible with the mutable static matrix though (it fails if I replace SMatrix with MMatrix in the above line), which is a pity… but I guess for an array of matrices that needs to be updated several times, this initialization overhead should be relatively small.

I asked a question a while ago, which has some similar themes to your question. You might find some of the answers I got to be useful.

1 Like

Thanks for the reference. It seems that my best option is to initialize a large array of MMatrix m = zeros(MMatrix{3,3,Float64,9},N) and then assign each element with a for-loop.

I would have opened a new thread for this. Could you ensure that your code is runnable? Which package is h5write from, for example?

As generic advice, have you checked the functions for type stability (e.g. @code_warntype read1())? I kind of expect h5read to be type-unstable.

Just opened a new topic! Thanks!

Note that these will (afaik) not be stored inline.
Is use the SMatrices. Note that while an SMatrix is immutable, so is a Float64, but you can still change the contents of an Array holding (either of) them.