Question on Julia Blog Post regarding Performance

TI36XPro · September 16, 2022, 3:23am

I was searching around for Julia comparisons to C++ and found this write-up from 2019.

I follow most of it but one point I was curious about is near the end where they implemented @simd, @fastmath, and @inbounds.

I would have expected to see @inbounds decorating the hot inner loop (I think that is line for Pj in particles unless I missed it). Instead they only have @simd decorating it and then wrap the actual loop calculations in a begin block which is decorated with @fastmath @inbounds.

It has never occurred to me to wrap such a large chunk of code in a begin block (I only find myself using them in pluto notebooks) so it might not be unusual at all but just new to me.

Why do that instead of putting it on the for loop? In fact, why not do @simd @fastmath @inbounds for Pj in particles? Is there a difference and if so what is it?

The entire code snippet in question, copy/pasted from the above link, is shown below for reference.

"""
    Implementation of @simd, @fastmath, and @inbounds
"""
function P2P_FINAL(particles, g_dgdr, U, J)

    for (i, Pi) in enumerate(particles)
        @simd for Pj in particles

            @fastmath @inbounds begin
                dX1 = Pi.X1 - Pj.X1
                dX2 = Pi.X2 - Pj.X2
                dX3 = Pi.X3 - Pj.X3
                r = sqrt(dX1*dX1 + dX2*dX2 + dX3*dX3)

                if r != 0

                    # g_σ and ∂gσ∂r
                    gsgm, dgsgmdr = g_dgdr(r / Pj.sigma)

                    # K × Γp
                    crss1 = -const4 / r^3 * ( dX2*Pj.Gamma3 - dX3*Pj.Gamma2 )
                    crss2 = -const4 / r^3 * ( dX3*Pj.Gamma1 - dX1*Pj.Gamma3 )
                    crss3 = -const4 / r^3 * ( dX1*Pj.Gamma2 - dX2*Pj.Gamma1 )

                    # U = ∑g_σ(x-xp) * K(x-xp) × Γp
                    U[1, i] += gsgm * crss1
                    U[2, i] += gsgm * crss2
                    U[3, i] += gsgm * crss3

                    # ∂u∂xj(x) = ∑[ ∂gσ∂xj(x−xp) * K(x−xp)×Γp + gσ(x−xp) * ∂K∂xj(x−xp)×Γp ]
                    # ∂u∂xj(x) += ∑p[(Δxj∂gσ∂r/(σr) − 3Δxjgσ/r^2) K(Δx)×Γp
                    aux = dgsgmdr/(Pj.sigma*r)* - 3*gsgm /r^2
                    # j=1
                    J[1, 1, i] += aux * crss1 * dX1
                    J[2, 1, i] += aux * crss2 * dX1
                    J[3, 1, i] += aux * crss3 * dX1
                    # j=2
                    J[1, 2, i] += aux * crss1 * dX2
                    J[2, 2, i] += aux * crss2 * dX2
                    J[3, 2, i] += aux * crss3 * dX2
                    # j=3
                    J[1, 3, i] += aux * crss1 * dX3
                    J[2, 3, i] += aux * crss2 * dX3
                    J[3, 3, i] += aux * crss3 * dX3

                    # Adds the Kronecker delta term
                    aux = -const4 * gsgm / r^3
                    # j=1
                    J[2, 1, i] -= aux * Pj.Gamma3
                    J[3, 1, i] += aux * Pj.Gamma2
                    # j=2
                    J[1, 2, i] += aux * Pj.Gamma3
                    J[3, 2, i] -= aux * Pj.Gamma1
                    # j=3
                    J[1, 3, i] -= aux * Pj.Gamma2
                    J[2, 3, i] += aux * Pj.Gamma1

                end
            end

        end
    end

end

Mason · September 16, 2022, 3:48am

No, it wouldn’t make a difference. Also, I doubt that @simd annotation is doing anything here, due to the access pattern in J, and the branching on r.

Topic		Replies	Views
Questions on a number of code acceleration techniques General Usage performance , hpc , parallel	11	1777	July 8, 2017
LoopVectorization: @turbo performs worse than @inbounds on trivial loop New to Julia question , simd , loopvectorization	9	2098	August 28, 2021
@inbounds code slower than one without General Usage	17	2295	March 9, 2019
Why I love Julia Community announcement	34	4822	October 11, 2019
Question on @inbounds New to Julia question	7	375	May 3, 2023

Question on Julia Blog Post regarding Performance

Related topics