I hope you like brainteasers

We all know how important it is to keep data that we process “local”. That should help the CPU memory pipeline to supply the data quickly.

One of the standard operations in the finite element method is the “gather”: collect some entries from a large array into a much smaller one. This is typically done with coordinates. From the entire mesh, collect just the coordinates of the nodes of the element.

I keep the “mesh” data here in a Matrix, and similarly the “element” data (buffer). However, the dimensions may be swapped. The question is how to arrange the dimensions, and how to access them in the gather operation: the loops may be arranged in one of two ways.

Below is a code that tries to answer that. But before you run it, try to guess which is the best way of arranging the mesh and element data. When you calculate the answer, it may surprise you.

Mesh data N x 3, Element buffer nen x 3, Loop i, j: Time ?[mus]
Mesh data N x 3, Element buffer nen x 3, Loop j, i: Time ?[mus]
Mesh data N x 3, Element buffer 3 x nen, Loop i, j: Time ?[mus]
Mesh data N x 3, Element buffer 3 x nen, Loop j, i: Time ?[mus]
Mesh data 3 x N, Element buffer nen x 3, Loop i, j: Time ?[mus]
Mesh data 3 x N, Element buffer nen x 3, Loop j, i: Time ?[mus]
Mesh data 3 x N, Element buffer 3 x nen, Loop i, j: Time ?[mus]
Mesh data 3 x N, Element buffer 3 x nen, Loop j, i: Time ?[mus]

I do realize that the answer may depend on the size of the mesh and the number of elements per node. I picked a reasonable default. It might be interesting to find out how the measurements change when the above numbers change.

Code:

module mgather1
using Test
using Random
function test()
  N = 1_000_000 # Number of nodes in the mesh
  nen = 10 # Number of nodes per element
  nloops = 2*N
  indexes = randperm(N)[1:nen]
  buffnen3 = rand(nen, 3)
  buff3nen = rand(3, nen)
  dataN3 = rand(N, 3)
  data3N = Matrix(transpose(dataN3))

  t1 = @elapsed for l in 1:nloops
    for i in 1:nen
      ii = indexes[i]
      for j in 1:3
        buffnen3[i, j] = dataN3[ii, j]
      end
    end
  end

  t2 = @elapsed for l in 1:nloops
    for j in 1:3
      for i in 1:nen
        ii = indexes[i]
        buffnen3[i, j] = dataN3[ii, j]
      end
    end
  end

  t3 = @elapsed for l in 1:nloops
    for i in 1:nen
      ii = indexes[i]
      for j in 1:3
        buff3nen[j, i] = dataN3[ii, j]
      end
    end
  end

  t4 = @elapsed for l in 1:nloops
    for j in 1:3
      for i in 1:nen
        ii = indexes[i]
        buff3nen[j, i] = dataN3[ii, j]
      end
    end
  end


  t5 = @elapsed for l in 1:nloops
    for i in 1:nen
      ii = indexes[i]
      for j in 1:3
        buffnen3[i, j] = data3N[j, ii]
      end
    end
  end

  t6 = @elapsed for l in 1:nloops
    for j in 1:3
      for i in 1:nen
        ii = indexes[i]
        buffnen3[i, j] = data3N[j, ii]
      end
    end
  end

  t7 = @elapsed for l in 1:nloops
    for i in 1:nen
      ii = indexes[i]
      for j in 1:3
        buff3nen[j, i] = data3N[j, ii]
      end
    end
  end

  t8 = @elapsed for l in 1:nloops
    for j in 1:3
      for i in 1:nen
        ii = indexes[i]
        buff3nen[j, i] = data3N[j, ii]
      end
    end
  end

  [t1, t2, t3, t4, t5, t6, t7, t8] ./ nloops .* 1e6 # In microseconds
end
end
using Main.mgather1
ts = [0.0 for i in 1:8]
ntries = 100
for i in 1:ntries
  ts .+= mgather1.test()
end
ts ./= ntries
ts = Float32.(ts)

println("Mesh data N x 3, Element buffer nen x 3, Loop i, j: Time $(ts[1]) [mus]")
println("Mesh data N x 3, Element buffer nen x 3, Loop j, i: Time $(ts[2]) [mus]")
println("Mesh data N x 3, Element buffer 3 x nen, Loop i, j: Time $(ts[3]) [mus]")
println("Mesh data N x 3, Element buffer 3 x nen, Loop j, i: Time $(ts[4]) [mus]")
println("Mesh data 3 x N, Element buffer nen x 3, Loop i, j: Time $(ts[5]) [mus]")
println("Mesh data 3 x N, Element buffer nen x 3, Loop j, i: Time $(ts[6]) [mus]")
println("Mesh data 3 x N, Element buffer 3 x nen, Loop i, j: Time $(ts[7]) [mus]")
println("Mesh data 3 x N, Element buffer 3 x nen, Loop j, i: Time $(ts[8]) [mus]")

By the way: if you do run the timing on your system, it will be interesting to post your results here. The optimal way might depend on the number of memory channels, various caches and so on. My particular timing was

Mesh data N x 3, Element buffer nen x 3, Loop i, j: Time 0.47031897 [mus]
Mesh data N x 3, Element buffer nen x 3, Loop j, i: Time 0.23725024 [mus]
Mesh data N x 3, Element buffer 3 x nen, Loop i, j: Time 0.4499188 [mus]
Mesh data N x 3, Element buffer 3 x nen, Loop j, i: Time 0.25969163 [mus]
Mesh data 3 x N, Element buffer nen x 3, Loop i, j: Time 0.3611833 [mus]
Mesh data 3 x N, Element buffer nen x 3, Loop j, i: Time 0.23566832 [mus]
Mesh data 3 x N, Element buffer 3 x nen, Loop i, j: Time 0.24359263 [mus]
Mesh data 3 x N, Element buffer 3 x nen, Loop j, i: Time 0.26704985 [mus]

on

julia> versioninfo()
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, haswell)
Environment:
  JULIA_EDITOR = "C:\Users\PK\AppData\Local\atom\app-1.40.1\atom.exe"  -a
  JULIA_NUM_THREADS = 2

PS: Surprisingly, 1.3 seems considerably faster than 1.2 ( same machine, obviously):

Mesh data N x 3, Element buffer nen x 3, Loop i, j: Time 0.07714446 [mus]
Mesh data N x 3, Element buffer nen x 3, Loop j, i: Time 0.047614664 [mus]
Mesh data N x 3, Element buffer 3 x nen, Loop i, j: Time 0.08304247 [mus]
Mesh data N x 3, Element buffer 3 x nen, Loop j, i: Time 0.045349218 [mus]
Mesh data 3 x N, Element buffer nen x 3, Loop i, j: Time 0.0655346 [mus]
Mesh data 3 x N, Element buffer nen x 3, Loop j, i: Time 0.046741426 [mus]
Mesh data 3 x N, Element buffer 3 x nen, Loop i, j: Time 0.05032052 [mus]
Mesh data 3 x N, Element buffer 3 x nen, Loop j, i: Time 0.048665706 [mus]