Are SharedArrays bypassing cache?

j-fu · February 6, 2020, 8:15pm

Hi, I am testing parallel computations based on multithreading/multiprocessing. Aimed at iterative solvers for PDEs I am interested in large loops. An interesting benchmark in this respect is the “Schönauer vector triad”, see e.g. the benchmarking site of Georg Hager.

(Here is the generating code)

From performing this test in the scalar case I see a striking performance difference between shared and normal arrays. For large arrays, the GFlop/s rates converge, in this case, for normal arrays the performance is limited by memory access due to exhaustion of the L3 cache. This lets me conclude that SharedArrays bypass the cache completely, which IMHO would be understandable as there needs to be some way to keep data coherent. I also suspect that this essentially due to the design of POSIX shared memory on the OS level and that Julia cannot do much about it. I googled for more evidence on this, but didn’t find any reasonable source.

Am I missing something ?

Please see also the MWE:

using SharedArrays
using BenchmarkTools

function vtriad(N,a,b,c,d)
    @inbounds @fastmath  for i=1:N
        d[i]=a[i]+b[i]*c[i]
    end
end

function runtest(N)
    a = rand(N)
    b = rand(N)
    c = rand(N)
    d = rand(N)
    
    @btime vtriad($N,$a,$b,$c,$d)
    
    sa = SharedArray(a)
    sb = SharedArray(b)
    sc = SharedArray(c)
    sd = SharedArray(d)

    @btime vtriad($N,$sa,$sb,$sc,$sd)
end

runtest(1000)

j-fu · February 7, 2020, 1:23pm

Update: surpisingly, @avx makes a difference here, thanks to @tbeason for the hint:

This would mean that optimization for shared arrays on the Julia side matters, and the OS is not at play here…

Still need to check if the computed results are correct, though (after some admin homework )

Here is the updated MWE:

using SharedArrays
using BenchmarkTools
using LoopVectorization

function vtriad(N,a,b,c,d)
    @avx  for i=1:N
        d[i]=a[i]+b[i]*c[i]
    end
end

function runtest(N)
    a = rand(N)
    b = rand(N)
    c = rand(N)
    d = rand(N)
    
    @btime vtriad($N,$a,$b,$c,$d)
    
    sa = SharedArray(a)
    sb = SharedArray(b)
    sc = SharedArray(c)
    sd = SharedArray(d)

    @btime vtriad($N,$sa,$sb,$sc,$sd)
end

runtest(1000)

Ralph_Smith · February 7, 2020, 1:55pm

This is very surprising. Perhaps @Elrod could explain how @avx is working here, and say whether there is a possible data race.

Elrod · February 7, 2020, 6:50pm

I think the problem is here.

Shouldn’t the getindex and setindex! definitions be @propogate_inbounds?

The @code_native shows it isn’t vectorized, and that there are bounds checks in the hot loop.

@avx doesn’t do bounds checks, which is why it was as fast as normal.

kristoffer.carlsson · February 7, 2020, 6:51pm

Topic		Replies	Views
Using Shared Arrays General Usage parallel	4	950	May 13, 2017
Memory Usage and SharedArrays Julia at Scale parallel , memory-allocation , distributed	8	1461	December 6, 2019
Some Questions about Parallelization New to Julia parallel	8	1065	March 18, 2018
SharedArrays crashes on Julia 1.3 Windows General Usage	13	829	December 10, 2019
Why is the result of the operation between `SharedMatrix`es not a `SharedMatrix`? Julia at Scale	1	635	December 27, 2017

Are SharedArrays bypassing cache?

Related topics