Computational performance 3D finite difference stencil for different vectorization methods and precision

Chiil · January 11, 2022, 9:34am

I have a code that generates with a macro a finite difference stencil. I have disabled the macro in order to make a minimal working example, so please forgive me for this absurdly long line, it is not there in my normal code. I have three questions concerning the performance of two functions, one with LoopVectorization and one without.

The performance difference is a factor 3. Is there a way to get the _ref function faster without having to use LoopVectorization, because the compilation time is very long (~250 s with LoopVectorization, ~ 1s without).
Is there a way to use precompilation here to avoid this 250 s?
Why is the performance gain of the _ref function so minimal if I move to Float32 while the function with @tturbo gains nearly a factor 2?

## Packages and settings.
using BenchmarkTools
using LoopVectorization

const float_type = Float64
const half = convert(float_type, 1/2)


## CPU dynamics kernel.
function dynamics_w_kernel!(
    wt, u, v, w, s,
    visc, alpha,
    dxi, dyi, dzi, dzhi,
    is, ie, js, je, ks, ke)

    @tturbo for k in ks:ke
        for j in js:je
            for i in is:ie
                # @fd (wt, u, v, w, s) wt += (
                #     - gradx(interpz(u) * interpx(w)) + visc * (gradx(gradx(w)))
                #     - grady(interpz(v) * interpy(w)) + visc * (grady(grady(w)))
                #     - gradz(interpz(w) * interpz(w)) + visc * (gradz(gradz(w)))
                #     + alpha*interpz(s) )
                wt[i, j, k] += ((((-((-1 * ((half * u[i, j, k - 1] + half * u[i, j, k]) * (half * w[i - 1, j, k] + half * w[i, j, k])) + 1 * ((half * u[i + 1, j, k - 1] + half * u[i + 1, j, k]) * (half * w[i, j, k] + half * w[i + 1, j, k]))) * dxi) + visc * ((-1 * ((-1 * w[i - 1, j, k] + 1 * w[i, j, k]) * dxi) + 1 * ((-1 * w[i, j, k] + 1 * w[i + 1, j, k]) * dxi)) * dxi)) - (-1 * ((half * v[i, j, k - 1] + half * v[i, j, k]) * (half * w[i, j - 1, k] + half * w[i, j, k])) + 1 * ((half * v[i, j + 1, k - 1] + half * v[i, j + 1, k]) * (half * w[i, j, k] + half * w[i, j + 1, k]))) * dyi) + visc * ((-1 * ((-1 * w[i, j - 1, k] + 1 * w[i, j, k]) * dyi) + 1 * ((-1 * w[i, j, k] + 1 * w[i, j + 1, k]) * dyi)) * dyi)) - (-1 * ((half * w[i, j, k - 1] + half * w[i, j, k]) * (half * w[i, j, k - 1] + half * w[i, j, k])) + 1 * ((half * w[i, j, k] + half * w[i, j, k + 1]) * (half * w[i, j, k] + half * w[i, j, k + 1]))) * dzhi[k]) + visc * ((-1 * ((-1 * w[i, j, k - 1] + 1 * w[i, j, k]) * dzi[k - 1]) + 1 * ((-1 * w[i, j, k] + 1 * w[i, j, k + 1]) * dzi[k])) * dzhi[k]) + alpha * (half * s[i, j, k - 1] + half * s[i, j, k])
            end
        end
    end
end


function dynamics_w_kernel_ref!(
    wt, u, v, w, s,
    visc, alpha,
    dxi, dyi, dzi, dzhi,
    is, ie, js, je, ks, ke)

    Threads.@threads for k in ks:ke
        for j in js:je
            @simd for i in is:ie
                @inbounds wt[i, j, k] += ((((-((-1 * ((half * u[i, j, k - 1] + half * u[i, j, k]) * (half * w[i - 1, j, k] + half * w[i, j, k])) + 1 * ((half * u[i + 1, j, k - 1] + half * u[i + 1, j, k]) * (half * w[i, j, k] + half * w[i + 1, j, k]))) * dxi) + visc * ((-1 * ((-1 * w[i - 1, j, k] + 1 * w[i, j, k]) * dxi) + 1 * ((-1 * w[i, j, k] + 1 * w[i + 1, j, k]) * dxi)) * dxi)) - (-1 * ((half * v[i, j, k - 1] + half * v[i, j, k]) * (half * w[i, j - 1, k] + half * w[i, j, k])) + 1 * ((half * v[i, j + 1, k - 1] + half * v[i, j + 1, k]) * (half * w[i, j, k] + half * w[i, j + 1, k]))) * dyi) + visc * ((-1 * ((-1 * w[i, j - 1, k] + 1 * w[i, j, k]) * dyi) + 1 * ((-1 * w[i, j, k] + 1 * w[i, j + 1, k]) * dyi)) * dyi)) - (-1 * ((half * w[i, j, k - 1] + half * w[i, j, k]) * (half * w[i, j, k - 1] + half * w[i, j, k])) + 1 * ((half * w[i, j, k] + half * w[i, j, k + 1]) * (half * w[i, j, k] + half * w[i, j, k + 1]))) * dzhi[k]) + visc * ((-1 * ((-1 * w[i, j, k - 1] + 1 * w[i, j, k]) * dzi[k - 1]) + 1 * ((-1 * w[i, j, k] + 1 * w[i, j, k + 1]) * dzi[k])) * dzhi[k]) + alpha * (half * s[i, j, k - 1] + half * s[i, j, k])
            end
        end
    end
end


## User input
itot = 384; jtot = 384; ktot = 384;
igc = 1; jgc = 1; kgc = 1;

icells = itot + 2igc; jcells = jtot + 2jgc; kcells = ktot + 2kgc
is = igc+1; ie = igc+itot; js = jgc+1; je = jgc+jtot; ks = kgc+1; ke = kgc+ktot

wt = rand(float_type, icells, jcells, kcells)
u = rand(float_type, icells, jcells, kcells)
v = rand(float_type, icells, jcells, kcells)
w = rand(float_type, icells, jcells, kcells)
s = rand(float_type, icells, jcells, kcells)
dxi = rand(float_type)
dyi = rand(float_type)
dzi = rand(float_type, kcells)
dzhi = rand(float_type, kcells)
visc = convert(float_type, 1)
alpha = convert(float_type, 9.81/300)


## Benchmark
for i in 1:10
    @time dynamics_w_kernel!(
        wt, u, v, w, s,
        visc, alpha,
        dxi, dyi, dzi, dzhi,
        is, ie, js, je, ks, ke)
end

for i in 1:10
    @time dynamics_w_kernel_ref!(
        wt, u, v, w, s,
        visc, alpha,
        dxi, dyi, dzi, dzhi,
        is, ie, js, je, ks, ke)
end

Chiil · January 13, 2022, 2:30pm

Any thoughts? I could use some good advice here.

Topic		Replies	Views
LoopVectorization triggers segfault or deadlock for complex finite difference stencil Performance loopvectorization	1	457	July 27, 2021
Loop vs vectorization Performance	4	1233	October 10, 2023
Can't understand what LoopVectorization is doing General Usage	7	752	September 1, 2021
Optimizing nested for loops and memory allocation in finite differences routine Performance	3	268	September 19, 2022
Understanding Julia performance in simple finite difference code New to Julia	38	4389	June 7, 2018

Computational performance 3D finite difference stencil for different vectorization methods and precision

Related topics