We all know how important it is to keep data that we process “local”. That should help the CPU memory pipeline to supply the data quickly.
One of the standard operations in the finite element method is the “gather”: collect some entries from a large array into a much smaller one. This is typically done with coordinates. From the entire mesh, collect just the coordinates of the nodes of the element.
I keep the “mesh” data here in a Matrix
, and similarly the “element” data (buffer). However, the dimensions may be swapped. The question is how to arrange the dimensions, and how to access them in the gather operation: the loops may be arranged in one of two ways.
Below is a code that tries to answer that. But before you run it, try to guess which is the best way of arranging the mesh and element data. When you calculate the answer, it may surprise you.
Mesh data N x 3, Element buffer nen x 3, Loop i, j: Time ?[mus]
Mesh data N x 3, Element buffer nen x 3, Loop j, i: Time ?[mus]
Mesh data N x 3, Element buffer 3 x nen, Loop i, j: Time ?[mus]
Mesh data N x 3, Element buffer 3 x nen, Loop j, i: Time ?[mus]
Mesh data 3 x N, Element buffer nen x 3, Loop i, j: Time ?[mus]
Mesh data 3 x N, Element buffer nen x 3, Loop j, i: Time ?[mus]
Mesh data 3 x N, Element buffer 3 x nen, Loop i, j: Time ?[mus]
Mesh data 3 x N, Element buffer 3 x nen, Loop j, i: Time ?[mus]
I do realize that the answer may depend on the size of the mesh and the number of elements per node. I picked a reasonable default. It might be interesting to find out how the measurements change when the above numbers change.
Code:
module mgather1
using Test
using Random
function test()
N = 1_000_000 # Number of nodes in the mesh
nen = 10 # Number of nodes per element
nloops = 2*N
indexes = randperm(N)[1:nen]
buffnen3 = rand(nen, 3)
buff3nen = rand(3, nen)
dataN3 = rand(N, 3)
data3N = Matrix(transpose(dataN3))
t1 = @elapsed for l in 1:nloops
for i in 1:nen
ii = indexes[i]
for j in 1:3
buffnen3[i, j] = dataN3[ii, j]
end
end
end
t2 = @elapsed for l in 1:nloops
for j in 1:3
for i in 1:nen
ii = indexes[i]
buffnen3[i, j] = dataN3[ii, j]
end
end
end
t3 = @elapsed for l in 1:nloops
for i in 1:nen
ii = indexes[i]
for j in 1:3
buff3nen[j, i] = dataN3[ii, j]
end
end
end
t4 = @elapsed for l in 1:nloops
for j in 1:3
for i in 1:nen
ii = indexes[i]
buff3nen[j, i] = dataN3[ii, j]
end
end
end
t5 = @elapsed for l in 1:nloops
for i in 1:nen
ii = indexes[i]
for j in 1:3
buffnen3[i, j] = data3N[j, ii]
end
end
end
t6 = @elapsed for l in 1:nloops
for j in 1:3
for i in 1:nen
ii = indexes[i]
buffnen3[i, j] = data3N[j, ii]
end
end
end
t7 = @elapsed for l in 1:nloops
for i in 1:nen
ii = indexes[i]
for j in 1:3
buff3nen[j, i] = data3N[j, ii]
end
end
end
t8 = @elapsed for l in 1:nloops
for j in 1:3
for i in 1:nen
ii = indexes[i]
buff3nen[j, i] = data3N[j, ii]
end
end
end
[t1, t2, t3, t4, t5, t6, t7, t8] ./ nloops .* 1e6 # In microseconds
end
end
using Main.mgather1
ts = [0.0 for i in 1:8]
ntries = 100
for i in 1:ntries
ts .+= mgather1.test()
end
ts ./= ntries
ts = Float32.(ts)
println("Mesh data N x 3, Element buffer nen x 3, Loop i, j: Time $(ts[1]) [mus]")
println("Mesh data N x 3, Element buffer nen x 3, Loop j, i: Time $(ts[2]) [mus]")
println("Mesh data N x 3, Element buffer 3 x nen, Loop i, j: Time $(ts[3]) [mus]")
println("Mesh data N x 3, Element buffer 3 x nen, Loop j, i: Time $(ts[4]) [mus]")
println("Mesh data 3 x N, Element buffer nen x 3, Loop i, j: Time $(ts[5]) [mus]")
println("Mesh data 3 x N, Element buffer nen x 3, Loop j, i: Time $(ts[6]) [mus]")
println("Mesh data 3 x N, Element buffer 3 x nen, Loop i, j: Time $(ts[7]) [mus]")
println("Mesh data 3 x N, Element buffer 3 x nen, Loop j, i: Time $(ts[8]) [mus]")
By the way: if you do run the timing on your system, it will be interesting to post your results here. The optimal way might depend on the number of memory channels, various caches and so on. My particular timing was
Mesh data N x 3, Element buffer nen x 3, Loop i, j: Time 0.47031897 [mus]
Mesh data N x 3, Element buffer nen x 3, Loop j, i: Time 0.23725024 [mus]
Mesh data N x 3, Element buffer 3 x nen, Loop i, j: Time 0.4499188 [mus]
Mesh data N x 3, Element buffer 3 x nen, Loop j, i: Time 0.25969163 [mus]
Mesh data 3 x N, Element buffer nen x 3, Loop i, j: Time 0.3611833 [mus]
Mesh data 3 x N, Element buffer nen x 3, Loop j, i: Time 0.23566832 [mus]
Mesh data 3 x N, Element buffer 3 x nen, Loop i, j: Time 0.24359263 [mus]
Mesh data 3 x N, Element buffer 3 x nen, Loop j, i: Time 0.26704985 [mus]
on
julia> versioninfo()
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, haswell)
Environment:
JULIA_EDITOR = "C:\Users\PK\AppData\Local\atom\app-1.40.1\atom.exe" -a
JULIA_NUM_THREADS = 2
PS: Surprisingly, 1.3 seems considerably faster than 1.2 ( same machine, obviously):
Mesh data N x 3, Element buffer nen x 3, Loop i, j: Time 0.07714446 [mus]
Mesh data N x 3, Element buffer nen x 3, Loop j, i: Time 0.047614664 [mus]
Mesh data N x 3, Element buffer 3 x nen, Loop i, j: Time 0.08304247 [mus]
Mesh data N x 3, Element buffer 3 x nen, Loop j, i: Time 0.045349218 [mus]
Mesh data 3 x N, Element buffer nen x 3, Loop i, j: Time 0.0655346 [mus]
Mesh data 3 x N, Element buffer nen x 3, Loop j, i: Time 0.046741426 [mus]
Mesh data 3 x N, Element buffer 3 x nen, Loop i, j: Time 0.05032052 [mus]
Mesh data 3 x N, Element buffer 3 x nen, Loop j, i: Time 0.048665706 [mus]