Vector of matrices vs. multidimensional arrays

I have to deal with 3 dimensional structures, I was hesitating between vectors of vectors of vectors, vectors of matrices or tridimensional arrays. For performance comparison, I wrote this small test:

julia> L = M = N = Int(5e2);

julia> vec_vec_vec = Array{Array{Vector}}(L);

julia> for i = 1:L vec_vec_vec[i] = [zeros(N) for j = 1:M] end;

julia> vec_mat = Array{Matrix}(L);

julia> fill!(vec_mat, rand(M, N));

julia> arr = rand(L, M, N);

julia> tic(); for i = 1:L for j = 1:M for k = 1:N vec_vec_vec[i][j][k] += 1; end; end; end; t = toc(); println(t)
elapsed time: 28.747643437 seconds
28.747643437

julia> tic(); for i = 1:L for j = 1:M for k = 1:N vec_mat[i][j,k] += 1; end; end; end; t = toc(); println(t)
elapsed time: 25.837854348 seconds
25.837854348

julia> tic(); for i = 1:L for j = 1:M for k = 1:N arr[i,j,k] += 1; end; end; end; t = toc(); println(t)
elapsed time: 100.412620007 seconds
100.412620007

Can someone explain to me why it is so much more expensive to use a 3d array rather than a vector of matrices, and why the vec_vec_vec and vec_mat solutions are similar? I understand that this example just looks at loops over the whole structures, and even in my case I need to perform matrix-related operations so vec_vec_vec is not very suitable. The ideal solution for me would have been to use the 3d array, but this wouldn’t be reasonable with such performances…

Many thanks

2 Likes

One reason your arr version is slow is that you’re accessing the indices in exactly the least cache-friendly way possible (this may be different than what you’re used to because Julia/Fortran/MATLAB use a column-major convention while C/NumPy use row major). Try reversing the order of your loops so that the first index changes in the innermost loop. See also https://docs.julialang.org/en/stable/manual/performance-tips/#Access-arrays-in-memory-order,-along-columns-1

Edit: this is relevant, but not nearly as important as the reminder not to use non-constant global variables (see below).

3 Likes

Before anything else, read the Performance Tips–you shouldn’t work with non-constant variables in global scope.

using BenchmarkTools

# declare global-scope variables as constants, otherwise the compiler can't
# stably infer the variable's type
const L = 500 # integer literals in Julia don't require conversion
const v3 = [[rand(L) for i = 1:L] for j = 1:L]
const vec_mat = Array{Matrix}(L);
const arr = rand(L, L, L);

fill!(vec_mat, rand(L, L));

@btime v3 .+= 1.0
@btime vec_mat .+= 1.0
@btime arr .+= 1.0
  5.767 s (250500 allocations: 993.80 MiB)
  5.411 s (1000 allocations: 953.71 MiB)
  86.659 ms (0 allocations: 0 bytes)
3 Likes
function test()
L = M = N = Int(5e2);

vec_vec_vec = Array{Array{Vector}}(L);

for i = 1:L vec_vec_vec[i] = [zeros(N) for j = 1:M] end;

vec_mat = Array{Matrix}(L);

fill!(vec_mat, rand(M, N));

arr = rand(L, M, N);

@time  for i = 1:L for j = 1:M for k = 1:N vec_vec_vec[i][j][k] += 1; end; end; end; 

@time  for i = 1:L for j = 1:M for k = 1:N vec_mat[i][j,k] += 1; end; end; end; 

@time  for k = 1:N for j = 1:M for i = 1:L arr[i,j,k] += 1; end; end; end; 

end
test()

results in

julia> include("test/playground.jl")
┌ Warning: `Array{T}(m::Int) where T` is deprecated, use `Array{T}(uninitialized, m)` instead.
│   caller = test() at playground.jl:4
â”” @ Main playground.jl:4
┌ Warning: `Array{T}(m::Int) where T` is deprecated, use `Array{T}(uninitialized, m)` instead.
│   caller = test() at playground.jl:8
â”” @ Main playground.jl:8
 13.718289 seconds (250.00 M allocations: 3.725 GiB, 3.44% gc time)
 11.875066 seconds (250.00 M allocations: 3.725 GiB, 0.58% gc time)
  0.193893 seconds

How’s that?

2 Likes

And another reason is that you are timing your function in the global scope, so every single access to arr, vec_mat, or vec_vec_vec or the sizes L, M, or N is essentially performing a global lookup. These global accesses make it harder to generate efficient code, and the difference can be significant. Performing global lookup inside your innermost loop is essentially the worst possible case.

For example, with no changes at all, here’s your arr version just wrapped in a function:

         L = size(arr, 1)
         M = size(arr, 2)
         N = size(arr, 3)
         for i = 1:L for j = 1:M for k = 1:N arr[i,j,k] += 1; end; end; end
       end
f (generic function with 1 method)

julia> @time f(arr)
  2.066944 seconds (3.50 k allocations: 175.565 KiB)

julia> @time f(arr)
  2.058089 seconds (4 allocations: 160 bytes)

Note that the second call is faster because the first call included Julia’s compilation overhead, but both versions are 50 times faster than your original code.

We can do even better by accessing the elements in the memory order:

julia> function f2(arr)
         L = size(arr, 1)
         M = size(arr, 2)
         N = size(arr, 3)
         for k = 1:N for j = 1:M for i=1:L arr[i,j,k] += 1; end; end; end
       end
f2 (generic function with 1 method)

julia> @time f2(arr)
  0.161044 seconds (3.50 k allocations: 175.644 KiB)

julia> @time f2(arr)
  0.144406 seconds (4 allocations: 160 bytes)

that’s almost 1000 times faster than the original version.

Remember to do these things when profiling is hard, which is why we have BenchmarkTools.jl. Here’s a much more representative benchmark:

julia> @benchmark for k = 1:$N for j = 1:$M for i = 1:$L; $arr[i,j,k] += 1; end; end; end
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     115.313 ms (0.00% GC)
  median time:      115.966 ms (0.00% GC)
  mean time:        116.025 ms (0.00% GC)
  maximum time:     118.212 ms (0.00% GC)
  --------------
  samples:          44
  evals/sample:     1

Note that this is about the same time we measured from the hand-written f2 function, but BenchmarkTools actually performs multiple trials to get a statistically meaningful estimate of the function’s performance.

Try using BenchmarkTools to measure the speed of each of your versions. You should find that they’re all quite fast in Julia :smile:

3 Likes

Thank you very much to the three of you, I understood many things. So I should be fine with using 3d arrays, but then if I have to perform matrix operations on such an array, I should ensure that I slice it according to the last two dimensions, i.e. operating on arr[i,:,:] should be much more efficient than operating on arr[:,:,k]…? (And the same goes for arr[i,j,:] and arr[:,j,k].)

I will time the different versions in a proper manner anyway.

Close (but exactly backwards :wink:). In a normal Julia array, the values arr[1, j, k] and arr[2, j, k] are stored next to each other in memory, so it’s most efficient to slice along the first dimension, since all those elements live next to each other. Likewise, slicing along the first N dimensions will be more efficient than the last N dimensions.

By the way, if you find it hard to remember which order to use when looping (I do), then I have good news: for looping over all the elements of an array, there’s an easier, better way. Just do:

for i in eachindex(arr)
  arr[i] += 1
end

eachindex() is a Julia function that will produce the “right” kind and order of indices for a particular array-like object of any number of dimensions. For example, for a normal Array it will produce indices in memory order, but it will also do the most efficient thing for a sparse array, some kind of weird row-major array, or lots of other array-like types. For example:

julia> @benchmark for i in eachindex($arr); $arr[i] += 1; end
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     120.798 ms (0.00% GC)
  median time:      121.303 ms (0.00% GC)
  mean time:        121.723 ms (0.00% GC)
  maximum time:     128.112 ms (0.00% GC)
  --------------
  samples:          42
  evals/sample:     1

just as fast as the correctly-ordered nested loops, but with no need to write them out.

5 Likes

Haha, thanks for your tolerance, I obviously did not think long enough before typing. Great additional information. Thank you for everything.

2 Likes

Sorry to dig up an old thread, but since this was referenced in another post I figured it was worth correcting the record.

The entire difference you’re observing between the vector-of-matrix approach and the 3D array approach here is due to your vector having a non-concrete element type. Your vec_mat is of type Array{Matrix}, but Matrix is not a concrete type. You need Array{Matrix{Float64}}. Likewise for your vector-of-vectors-of-vectors approach, you need an Array{Vector{Vector{Float64}}} not Array{Vector{Vector}}.

Switching to concretely typed structures makes the 3D array and vector-of-matrix approaches perform almost exactly the same:

julia> function f(x::Array{T, 3}) where T
         x .+= 1
       end
f (generic function with 2 methods)

julia> function f(x::Array{Matrix{T}}) where T
         for matrix in x
           matrix .+= 1
         end
       end
f (generic function with 2 methods)

julia> x_array = zeros(N, N, N);

julia> x_vec_mat = [zeros(N, N) for i in 1:N];

julia> using BenchmarkTools

julia> @btime f($x_array);
  609.838 ÎĽs (0 allocations: 0 bytes)

julia> @btime f($x_vec_mat);
  505.012 ÎĽs (0 allocations: 0 bytes)
4 Likes

True:

function test()
	L = M = N = Int(5e2);
	vec_vec_vec = Array{Vector{Vector{Float64}}}(undef, L);
	for i = 1:L vec_vec_vec[i] = [zeros(N) for j = 1:M] end;
	vec_mat = Array{Matrix{Float64}}(undef, L);
	fill!(vec_mat, rand(M, N));
	arr = rand(L, M, N);
	@time  for i = 1:L for j = 1:M for k = 1:N vec_vec_vec[i][j][k] += 1; end; end; end; 
	@time  for i = 1:L for j = 1:M for k = 1:N vec_mat[i][j,k] += 1; end; end; end; 
	@time  for k = 1:N for j = 1:M for i = 1:L arr[i,j,k] += 1; end; end; end; 
end
test()

gives

julia> include("playground.jl");                                                        
  0.203685 seconds                                                                      
  0.224881 seconds                                                                      
  0.147060 seconds    

with

julia> versioninfo()                                                                    
Julia Version 1.2.0-DEV.158                                                             
Commit 7772486a69 (2019-01-12 19:21 UTC)                                                
Platform Info:                                                                          
  OS: Windows (x86_64-w64-mingw32)                                                      
  CPU: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz                                         
  WORD_SIZE: 64                                                                         
  LIBM: libopenlibm                                                                     
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)      
2 Likes

Sorry to add just one more small thing to this, but instead of
for i=1:L for j=1:M for k=1:N … end; end; end
you can write
for i=1:L, j=1:M, k=1:N … end
which is shorter and easier to read.

2 Likes

You can rearrange the indexing to optimal memory order for the vector-of-matrices version to speed it up. Also, @inbounds helps a lot here:

function test()
	L = M = N = Int(5e2);
	vec_vec_vec = Array{Vector{Vector{Float64}}}(undef, L);
	for i = 1:L vec_vec_vec[i] = [zeros(N) for j = 1:M] end;
	vec_mat = Array{Matrix{Float64}}(undef, L);
	fill!(vec_mat, rand(M, N));
	arr = rand(L, M, N);
	@time @inbounds for i = 1:L, j = 1:M, k = 1:N vec_vec_vec[i][j][k] += 1 end
	@time @inbounds for i = 1:L, k = 1:N, j = 1:M vec_mat[i][j,k] += 1 end
	@time @inbounds for k = 1:N, j = 1:M, i = 1:L arr[i,j,k] += 1 end
end
test()

Gives me:

  0.124387 seconds
  0.059382 seconds
  0.097464 seconds

There’s quite a difference still in performance between the three versions IMO (the fastest being twice as fast as the slowest), I haven’t looked into what’s causing that. I’d guess that either some versions are being vectorized (SIMD) more efficiently, or it has to do with memory accesses and caching.


Edit: BenchmarkTools gives similar timings:

julia> @btime @inbounds for i = 1:$L, j = 1:$M, k = 1:$N; $vec_vec_vec[i][j][k] += 1; end
  111.676 ms (0 allocations: 0 bytes)

julia> @btime @inbounds for i = 1:$L, k = 1:$N, j = 1:$M; $vec_mat[i][j,k] += 1; end
  58.472 ms (0 allocations: 0 bytes)

julia> @btime @inbounds for k = 1:$N, j = 1:$M, i = 1:$L; $arr[i,j,k] += 1; end
  76.587 ms (0 allocations: 0 bytes)
3 Likes

Will give you the same matrix in all the slots for vec_mat. Probably explains the timings. There is no reason why the last one should not be the fastest.

4 Likes

Oh, good point, I missed that! Indeed, fixing that, the last one is now fastest:

  113.182 ms (0 allocations: 0 bytes)
  94.457 ms (0 allocations: 0 bytes)
  79.743 ms (0 allocations: 0 bytes)