Hi, I am testing parallel computations based on multithreading/multiprocessing. Aimed at iterative solvers for PDEs I am interested in large loops. An interesting benchmark in this respect is the “Schönauer vector triad”, see e.g. the benchmarking site of Georg Hager.
(Here is the generating code)
From performing this test in the scalar case I see a striking performance difference between shared and normal arrays. For large arrays, the GFlop/s rates converge, in this case, for normal arrays the performance is limited by memory access due to exhaustion of the L3 cache. This lets me conclude that SharedArrays bypass the cache completely, which IMHO would be understandable as there needs to be some way to keep data coherent. I also suspect that this essentially due to the design of POSIX shared memory on the OS level and that Julia cannot do much about it. I googled for more evidence on this, but didn’t find any reasonable source.
Am I missing something ?
Please see also the MWE:
using SharedArrays
using BenchmarkTools
function vtriad(N,a,b,c,d)
@inbounds @fastmath for i=1:N
d[i]=a[i]+b[i]*c[i]
end
end
function runtest(N)
a = rand(N)
b = rand(N)
c = rand(N)
d = rand(N)
@btime vtriad($N,$a,$b,$c,$d)
sa = SharedArray(a)
sb = SharedArray(b)
sc = SharedArray(c)
sd = SharedArray(d)
@btime vtriad($N,$sa,$sb,$sc,$sd)
end
runtest(1000)