How to use threads in a reduction with LoopVectorization?

I have the following function. How do I make this parallel with LoopVectorization? In C++ I would add the reduction clause and the associated variable to my OpenMP #pragma, but in Julia I do not know what is the best performing way to do so.

function divergence_kernel(
    u, v, w,
    dxi, dyi, dzi,
    is, ie, js, je, ks, ke)

    divmax = 0
    @inbounds for k in ks:ke
        for j in js:je
            for i in is:ie
                div = @fd (u, v, w) gradx(u) + grady(v) + gradz(w)
                divmax = max(divmax, div)
            end
        end
    end

    return divmax
end

Could you add code to copy/paste and run?
LoopVectorization should figure out that divmax is a reduction on its own.
You can use @turbo threads=true or @tturbo for threads.

1 Like

This would be a minimal working example, and indeed, LoopVectorization does it already perfect. I am just conditioned by OpenMP by never putting parallel statements in front of a reduction loop. So I guess that closes my question.

## Packages
using BenchmarkTools
using LoopVectorization


## Reference kernel
function divergence_ref(
    u, v, w,
    is, ie, js, je, ks, ke)

    divmax = 0
    @inbounds for k in ks:ke
        for j in js:je
            for i in is:ie
                div = u[i, j, k] + v[i, j, k] + w[i, j, k]
                divmax = max(abs(div), divmax)
            end
        end
    end

    return divmax
end


## Reference kernel
function divergence(
    u, v, w,
    is, ie, js, je, ks, ke)

    divmax = 0
    @tturbo for k in ks:ke
        for j in js:je
            for i in is:ie
                div = u[i, j, k] + v[i, j, k] + w[i, j, k]
                divmax = max(abs(div), divmax)
            end
        end
    end

    return divmax
end


## User input
itot = 384; jtot = 384; ktot = 384;
igc = 1; jgc = 1; kgc = 1;

icells = itot + 2igc; jcells = jtot + 2jgc; kcells = ktot + 2kgc
is = igc+1; ie = igc+itot; js = jgc+1; je = jgc+jtot; ks = kgc+1; ke = kgc+ktot

u = rand(icells, jcells, kcells)
v = rand(icells, jcells, kcells)
w = rand(icells, jcells, kcells)


## Run reference kernel
div_ref = divergence_ref(
    u, v, w,
    is, ie, js, je, ks, ke)

@btime dummy = divergence_ref(
    $u, $v, $w,
    $is, $ie, $js, $je, $ks, $ke)


## Run fast kernel
div = divergence(
    u, v, w,
    is, ie, js, je, ks, ke)

@btime divergence(
    $u, $v, $w,
    $is, $ie, $js, $je, $ks, $ke)


## Compare results
println(div_ref, " ", div, " ", isapprox(div, div_ref))
1 Like

If you’re looking for reductions and near optimal iteration orders, you can also check out Tullio.jl - it works better for large matrices and combined with LoopVectorization.jl (just load both at the same time), you should get a good boost.

4 Likes