3D Matrix vs Array of 2D Matrices using Comprehensions

I am trying to populate a large, dense 3D matrix (3000 x 18 x 1000). I tried iterating over all positions of the 3D matrix (testing various row/column iteration orders) - see Attempt 1. I also tried array comprehensions on an array of matrices - see Attempt 2. The timing for both is about the same (see below). Does anyone have suggestions for how to further speed this up? The examples below have the first dimension at 50 instead of 3000, but I need to scale up to 3000 (see the pointNumber dimension in the code).

Attempt 1 - iterate over all positions of 3D array with nested for loop

Summary
using LinearAlgebra
using BenchmarkTools

function greensMatrix3D1(matchCoordinates,axisCoordinates,zArray)
     
    greensMatrix = zeros(Float64,length(matchCoordinates),length(axisCoordinates),length(zArray))
    
    for v = 1:length(matchCoordinates),q = 1:length(axisCoordinates),k=1:length(zArray)
        rvec = matchCoordinates[v] - [axisCoordinates[q][1],axisCoordinates[q][2],zArray[k]]
        rivec = rvec + [0, 2*axisCoordinates[q][2],0]
        r = norm(rvec)
        ri = norm(rivec)
        greensMatrix[v,q,k] = (1/r - 1/ri) 
    end
    
    return greensMatrix   
end

pointNumber = 50 #target 3000
matchCoordinates = [[10,i*0.02,0] for i in 1:pointNumber]
axisCoordinates = [[i-1,10,0] for i in 1:18]
zArray = range(-100,100,length=1024)

@btime output1 = greensMatrix3D1(matchCoordinates,axisCoordinates,zArray)

Attempt 2 - use comprehensions and an array of matrices

Summary
using LinearAlgebra

function oneGreens(xyz1,xyz2)
    rvec = xyz1 - xyz2
    rivec = rvec + [0, 2*xyz2[2],0]
    r = norm(rvec)
    ri = norm(rivec)
    greensMatrix = (1/r - 1/ri)
    return greensMatrix
end

function greensMatrix3D(matchCoordinates,axisCoordinates,zArray)
     
    greensMatrixArray = [zeros(Float64,length(matchCoordinates),length(axisCoordinates)) for _ in 1:length(zArray)]
    for i = 1:length(zArray)
        axisTempCoordinates = axisCoordinates
        setindex!.(axisTempCoordinates,zArray[i],3)
        greensMatrixArray[i] = [oneGreens(j,k) for j in matchCoordinates, k in axisTempCoordinates]
    end
    
    return greensMatrixArray   
end

pointNumber = 50 #target 3000
matchCoordinates = [[10.0,i*0.02,0.0] for i in 1:pointNumber]
axisCoordinates = [[i-1.0,10.0,0.0] for i in 1:18]
zArray = range(-100.0,100.0,length=1024)

@btime output2 = greensMatrix3D(matchCoordinates,axisCoordinates,zArray)

The timing results for the above are:
Attempt 1:>>> 1.826 s (8294402 allocations: 513.28 MiB)
Attempt 2:>>> 1.673 s (7375873 allocations: 422.41 MiB)

When you see that many allocations from your benchmark, that should be the first clue that something is sub-optimal. In particular, your function allocates brand new arrays in several places in your inner loop:

rvec = matchCoordinates[v] - [axisCoordinates[q][1],axisCoordinates[q][2],zArray[k]]

This allocates two new arrays, one holding the elements on the right-hand-side of the - and then another one for the result of the subtraction.

rivec = rvec + [0, 2*axisCoordinates[q][2],0]

likewise, this line allocates two more arrays. Since this is happening inside your loop, you end up seeing a huge number of allocation events when filling your whole matrix.

Since your matches and axes consist of a large number of small arrays, this is a textbook use case for https://github.com/JuliaArrays/StaticArrays.jl

For example, by storing the matchCoordinates and axisCoordinates as static arrays (SVectors, to be specific) and constructing more static arrays inside the loop instead of plain arrays, we can reduce the allocations from 8294402 to just 2:

using StaticArrays


function greensMatrix3D3(matchCoordinates,axisCoordinates,zArray)
    
   greensMatrix = zeros(Float64,length(matchCoordinates),length(axisCoordinates),length(zArray))
   
   for v = 1:length(matchCoordinates),q = 1:length(axisCoordinates),k=1:length(zArray)
       rvec = matchCoordinates[v] - SVector(axisCoordinates[q][1], axisCoordinates[q][2], zArray[k])
       rivec = rvec + SVector(0, 2*axisCoordinates[q][2], 0)
       r = norm(rvec)
       ri = norm(rivec)
       greensMatrix[v,q,k] = (1/r - 1/ri) 
   end
   
   return greensMatrix   
end

pointNumber = 50 #target 3000
matchCoordinates = [SVector(10, i * 0.02, 0) for i in 1:pointNumber]
axisCoordinates = [SVector(i-1, 10, 0) for i in 1:18]
zArray = range(-100,100,length=1024)
@btime greensMatrix3D3($matchCoordinates, $axisCoordinates, $zArray);

Output:

  7.016 ms (2 allocations: 7.03 MiB)

That’s already 250 times faster, but there’s even more we can do.

Adding @inbounds helps a little bit:

@inbounds for v = 1:length(matchCoordinates),q = 1:length(axisCoordinates),k=1:length(zArray)
  ...
end

giving a new time of 6.872 ms (2 allocations: 7.03 MiB).

Much more helpful is fixing the order of the nested loops. It turns out that your loop order was exactly backwards (see https://docs.julialang.org/en/v1/manual/performance-tips/#man-performance-column-major ), but it’s easy to fix:

function greensMatrix3D5(matchCoordinates,axisCoordinates,zArray)
     
    greensMatrix = zeros(Float64,length(matchCoordinates),length(axisCoordinates),length(zArray))
    
    @inbounds for k=1:length(zArray), q = 1:length(axisCoordinates), v = 1:length(matchCoordinates)
        rvec = matchCoordinates[v] - SVector(axisCoordinates[q][1], axisCoordinates[q][2], zArray[k])
        rivec = rvec + SVector(0, 2*axisCoordinates[q][2], 0)
        r = norm(rvec)
        ri = norm(rivec)
        greensMatrix[v,q,k] = (1/r - 1/ri) 
    end
    
    return greensMatrix   
end

pointNumber = 50 #target 3000
matchCoordinates = [SVector(10, i * 0.02, 0) for i in 1:pointNumber]
axisCoordinates = [SVector(i-1, 10, 0) for i in 1:18]
zArray = range(-100,100,length=1024)
@btime greensMatrix3D5($matchCoordinates, $axisCoordinates, $zArray);

That gives a final result of:

  3.301 ms (2 allocations: 7.03 MiB)

or about 500 times faster than the original function. There’s probably even more that can be done, but this is a pretty good return on a small investment of effort.

9 Likes

By the way, with the StaticArrays change and the fixed loop order, computing 3000 points takes just 276ms:

pointNumber = 3000 #target 3000
matchCoordinates = [SVector(10, i * 0.02, 0) for i in 1:pointNumber]
axisCoordinates = [SVector(i-1, 10, 0) for i in 1:18]
zArray = range(-100,100,length=1024)
@btime greensMatrix3D5($matchCoordinates, $axisCoordinates, $zArray);

output:

  276.100 ms (2 allocations: 421.88 MiB)
7 Likes

Excellent. Thank you!

I’m trying to understand this column major concept a bit better. If the rows and columns are of unequal size and we have flexibility in how those are assigned, is it better to assign the larger dimension to rows?

No, not necessarily. The key is to ensure that sequential access to the data mostly happens along the first dimension. For example, summation along each column of a matrix is faster than summation along each row, so if your algorithm requires you to sum each 1-D slice, then you should arrange your matrix so that that summation happens along columns instead of rows. In other cases, it may be easier to change your algorithm (as I did above) to iterate over the data in a different order without any changes to how that data is actually arranged. The term to look up to learn more about this is “cache locality”.

4 Likes