Vector of matrices vs. multidimensional arrays


#1

I have to deal with 3 dimensional structures, I was hesitating between vectors of vectors of vectors, vectors of matrices or tridimensional arrays. For performance comparison, I wrote this small test:

julia> L = M = N = Int(5e2);

julia> vec_vec_vec = Array{Array{Vector}}(L);

julia> for i = 1:L vec_vec_vec[i] = [zeros(N) for j = 1:M] end;

julia> vec_mat = Array{Matrix}(L);

julia> fill!(vec_mat, rand(M, N));

julia> arr = rand(L, M, N);

julia> tic(); for i = 1:L for j = 1:M for k = 1:N vec_vec_vec[i][j][k] += 1; end; end; end; t = toc(); println(t)
elapsed time: 28.747643437 seconds
28.747643437

julia> tic(); for i = 1:L for j = 1:M for k = 1:N vec_mat[i][j,k] += 1; end; end; end; t = toc(); println(t)
elapsed time: 25.837854348 seconds
25.837854348

julia> tic(); for i = 1:L for j = 1:M for k = 1:N arr[i,j,k] += 1; end; end; end; t = toc(); println(t)
elapsed time: 100.412620007 seconds
100.412620007

Can someone explain to me why it is so much more expensive to use a 3d array rather than a vector of matrices, and why the vec_vec_vec and vec_mat solutions are similar? I understand that this example just looks at loops over the whole structures, and even in my case I need to perform matrix-related operations so vec_vec_vec is not very suitable. The ideal solution for me would have been to use the 3d array, but this wouldn’t be reasonable with such performances…

Many thanks


#2

One reason your arr version is slow is that you’re accessing the indices in exactly the least cache-friendly way possible (this may be different than what you’re used to because Julia/Fortran/MATLAB use a column-major convention while C/NumPy use row major). Try reversing the order of your loops so that the first index changes in the innermost loop. See also https://docs.julialang.org/en/stable/manual/performance-tips/#Access-arrays-in-memory-order,-along-columns-1

Edit: this is relevant, but not nearly as important as the reminder not to use non-constant global variables (see below).


#3

Before anything else, read the Performance Tips–you shouldn’t work with non-constant variables in global scope.

using BenchmarkTools

# declare global-scope variables as constants, otherwise the compiler can't
# stably infer the variable's type
const L = 500 # integer literals in Julia don't require conversion
const v3 = [[rand(L) for i = 1:L] for j = 1:L]
const vec_mat = Array{Matrix}(L);
const arr = rand(L, L, L);

fill!(vec_mat, rand(L, L));

@btime v3 .+= 1.0
@btime vec_mat .+= 1.0
@btime arr .+= 1.0
  5.767 s (250500 allocations: 993.80 MiB)
  5.411 s (1000 allocations: 953.71 MiB)
  86.659 ms (0 allocations: 0 bytes)

#4
function test()
L = M = N = Int(5e2);

vec_vec_vec = Array{Array{Vector}}(L);

for i = 1:L vec_vec_vec[i] = [zeros(N) for j = 1:M] end;

vec_mat = Array{Matrix}(L);

fill!(vec_mat, rand(M, N));

arr = rand(L, M, N);

@time  for i = 1:L for j = 1:M for k = 1:N vec_vec_vec[i][j][k] += 1; end; end; end; 

@time  for i = 1:L for j = 1:M for k = 1:N vec_mat[i][j,k] += 1; end; end; end; 

@time  for k = 1:N for j = 1:M for i = 1:L arr[i,j,k] += 1; end; end; end; 

end
test()

results in

julia> include("test/playground.jl")
┌ Warning: `Array{T}(m::Int) where T` is deprecated, use `Array{T}(uninitialized, m)` instead.
│   caller = test() at playground.jl:4
â”” @ Main playground.jl:4
┌ Warning: `Array{T}(m::Int) where T` is deprecated, use `Array{T}(uninitialized, m)` instead.
│   caller = test() at playground.jl:8
â”” @ Main playground.jl:8
 13.718289 seconds (250.00 M allocations: 3.725 GiB, 3.44% gc time)
 11.875066 seconds (250.00 M allocations: 3.725 GiB, 0.58% gc time)
  0.193893 seconds

How’s that?


#5

And another reason is that you are timing your function in the global scope, so every single access to arr, vec_mat, or vec_vec_vec or the sizes L, M, or N is essentially performing a global lookup. These global accesses make it harder to generate efficient code, and the difference can be significant. Performing global lookup inside your innermost loop is essentially the worst possible case.

For example, with no changes at all, here’s your arr version just wrapped in a function:

         L = size(arr, 1)
         M = size(arr, 2)
         N = size(arr, 3)
         for i = 1:L for j = 1:M for k = 1:N arr[i,j,k] += 1; end; end; end
       end
f (generic function with 1 method)

julia> @time f(arr)
  2.066944 seconds (3.50 k allocations: 175.565 KiB)

julia> @time f(arr)
  2.058089 seconds (4 allocations: 160 bytes)

Note that the second call is faster because the first call included Julia’s compilation overhead, but both versions are 50 times faster than your original code.

We can do even better by accessing the elements in the memory order:

julia> function f2(arr)
         L = size(arr, 1)
         M = size(arr, 2)
         N = size(arr, 3)
         for k = 1:N for j = 1:M for i=1:L arr[i,j,k] += 1; end; end; end
       end
f2 (generic function with 1 method)

julia> @time f2(arr)
  0.161044 seconds (3.50 k allocations: 175.644 KiB)

julia> @time f2(arr)
  0.144406 seconds (4 allocations: 160 bytes)

that’s almost 1000 times faster than the original version.

Remember to do these things when profiling is hard, which is why we have BenchmarkTools.jl. Here’s a much more representative benchmark:

julia> @benchmark for k = 1:$N for j = 1:$M for i = 1:$L; $arr[i,j,k] += 1; end; end; end
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     115.313 ms (0.00% GC)
  median time:      115.966 ms (0.00% GC)
  mean time:        116.025 ms (0.00% GC)
  maximum time:     118.212 ms (0.00% GC)
  --------------
  samples:          44
  evals/sample:     1

Note that this is about the same time we measured from the hand-written f2 function, but BenchmarkTools actually performs multiple trials to get a statistically meaningful estimate of the function’s performance.

Try using BenchmarkTools to measure the speed of each of your versions. You should find that they’re all quite fast in Julia :smile:


#6

Thank you very much to the three of you, I understood many things. So I should be fine with using 3d arrays, but then if I have to perform matrix operations on such an array, I should ensure that I slice it according to the last two dimensions, i.e. operating on arr[i,:,:] should be much more efficient than operating on arr[:,:,k]…? (And the same goes for arr[i,j,:] and arr[:,j,k].)

I will time the different versions in a proper manner anyway.


#7

Close (but exactly backwards :wink:). In a normal Julia array, the values arr[1, j, k] and arr[2, j, k] are stored next to each other in memory, so it’s most efficient to slice along the first dimension, since all those elements live next to each other. Likewise, slicing along the first N dimensions will be more efficient than the last N dimensions.

By the way, if you find it hard to remember which order to use when looping (I do), then I have good news: for looping over all the elements of an array, there’s an easier, better way. Just do:

for i in eachindex(arr)
  arr[i] += 1
end

eachindex() is a Julia function that will produce the “right” kind and order of indices for a particular array-like object of any number of dimensions. For example, for a normal Array it will produce indices in memory order, but it will also do the most efficient thing for a sparse array, some kind of weird row-major array, or lots of other array-like types. For example:

julia> @benchmark for i in eachindex($arr); $arr[i] += 1; end
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     120.798 ms (0.00% GC)
  median time:      121.303 ms (0.00% GC)
  mean time:        121.723 ms (0.00% GC)
  maximum time:     128.112 ms (0.00% GC)
  --------------
  samples:          42
  evals/sample:     1

just as fast as the correctly-ordered nested loops, but with no need to write them out.


#8

Haha, thanks for your tolerance, I obviously did not think long enough before typing. Great additional information. Thank you for everything.