Are SharedArrays bypassing cache?

Hi, I am testing parallel computations based on multithreading/multiprocessing. Aimed at iterative solvers for PDEs I am interested in large loops. An interesting benchmark in this respect is the “Schönauer vector triad”, see e.g. the benchmarking site of Georg Hager.

(Here is the generating code)

From performing this test in the scalar case I see a striking performance difference between shared and normal arrays. For large arrays, the GFlop/s rates converge, in this case, for normal arrays the performance is limited by memory access due to exhaustion of the L3 cache. This lets me conclude that SharedArrays bypass the cache completely, which IMHO would be understandable as there needs to be some way to keep data coherent. I also suspect that this essentially due to the design of POSIX shared memory on the OS level and that Julia cannot do much about it. I googled for more evidence on this, but didn’t find any reasonable source.

Am I missing something ?

Please see also the MWE:

using SharedArrays
using BenchmarkTools

function vtriad(N,a,b,c,d)
    @inbounds @fastmath  for i=1:N
        d[i]=a[i]+b[i]*c[i]
    end
end

function runtest(N)
    a = rand(N)
    b = rand(N)
    c = rand(N)
    d = rand(N)
    
    @btime vtriad($N,$a,$b,$c,$d)
    
    sa = SharedArray(a)
    sb = SharedArray(b)
    sc = SharedArray(c)
    sd = SharedArray(d)

    @btime vtriad($N,$sa,$sb,$sc,$sd)
end

runtest(1000)

Update: surpisingly, @avx makes a difference here, thanks to @tbeason for the hint:

This would mean that optimization for shared arrays on the Julia side matters, and the OS is not at play here…

Still need to check if the computed results are correct, though (after some admin homework :frowning: )

Here is the updated MWE:

using SharedArrays
using BenchmarkTools
using LoopVectorization

function vtriad(N,a,b,c,d)
    @avx  for i=1:N
        d[i]=a[i]+b[i]*c[i]
    end
end

function runtest(N)
    a = rand(N)
    b = rand(N)
    c = rand(N)
    d = rand(N)
    
    @btime vtriad($N,$a,$b,$c,$d)
    
    sa = SharedArray(a)
    sb = SharedArray(b)
    sc = SharedArray(c)
    sd = SharedArray(d)

    @btime vtriad($N,$sa,$sb,$sc,$sd)
end

runtest(1000)

This is very surprising. Perhaps @Elrod could explain how @avx is working here, and say whether there is a possible data race.

I think the problem is here.

Shouldn’t the getindex and setindex! definitions be @propogate_inbounds?

The @code_native shows it isn’t vectorized, and that there are bounds checks in the hot loop.

@avx doesn’t do bounds checks, which is why it was as fast as normal.

2 Likes

:+1: