Question on Julia Blog Post regarding Performance

I was searching around for Julia comparisons to C++ and found this write-up from 2019.

I follow most of it but one point I was curious about is near the end where they implemented @simd, @fastmath, and @inbounds.

I would have expected to see @inbounds decorating the hot inner loop (I think that is line for Pj in particles unless I missed it). Instead they only have @simd decorating it and then wrap the actual loop calculations in a begin block which is decorated with @fastmath @inbounds.

It has never occurred to me to wrap such a large chunk of code in a begin block (I only find myself using them in pluto notebooks) so it might not be unusual at all but just new to me.

Why do that instead of putting it on the for loop? In fact, why not do @simd @fastmath @inbounds for Pj in particles? Is there a difference and if so what is it?

The entire code snippet in question, copy/pasted from the above link, is shown below for reference.

"""
    Implementation of @simd, @fastmath, and @inbounds
"""
function P2P_FINAL(particles, g_dgdr, U, J)

    for (i, Pi) in enumerate(particles)
        @simd for Pj in particles

            @fastmath @inbounds begin
                dX1 = Pi.X1 - Pj.X1
                dX2 = Pi.X2 - Pj.X2
                dX3 = Pi.X3 - Pj.X3
                r = sqrt(dX1*dX1 + dX2*dX2 + dX3*dX3)

                if r != 0

                    # g_σ and ∂gσ∂r
                    gsgm, dgsgmdr = g_dgdr(r / Pj.sigma)

                    # K × Γp
                    crss1 = -const4 / r^3 * ( dX2*Pj.Gamma3 - dX3*Pj.Gamma2 )
                    crss2 = -const4 / r^3 * ( dX3*Pj.Gamma1 - dX1*Pj.Gamma3 )
                    crss3 = -const4 / r^3 * ( dX1*Pj.Gamma2 - dX2*Pj.Gamma1 )

                    # U = ∑g_σ(x-xp) * K(x-xp) × Γp
                    U[1, i] += gsgm * crss1
                    U[2, i] += gsgm * crss2
                    U[3, i] += gsgm * crss3

                    # ∂u∂xj(x) = ∑[ ∂gσ∂xj(x−xp) * K(x−xp)×Γp + gσ(x−xp) * ∂K∂xj(x−xp)×Γp ]
                    # ∂u∂xj(x) += ∑p[(Δxj∂gσ∂r/(σr) − 3Δxjgσ/r^2) K(Δx)×Γp
                    aux = dgsgmdr/(Pj.sigma*r)* - 3*gsgm /r^2
                    # j=1
                    J[1, 1, i] += aux * crss1 * dX1
                    J[2, 1, i] += aux * crss2 * dX1
                    J[3, 1, i] += aux * crss3 * dX1
                    # j=2
                    J[1, 2, i] += aux * crss1 * dX2
                    J[2, 2, i] += aux * crss2 * dX2
                    J[3, 2, i] += aux * crss3 * dX2
                    # j=3
                    J[1, 3, i] += aux * crss1 * dX3
                    J[2, 3, i] += aux * crss2 * dX3
                    J[3, 3, i] += aux * crss3 * dX3

                    # Adds the Kronecker delta term
                    aux = -const4 * gsgm / r^3
                    # j=1
                    J[2, 1, i] -= aux * Pj.Gamma3
                    J[3, 1, i] += aux * Pj.Gamma2
                    # j=2
                    J[1, 2, i] += aux * Pj.Gamma3
                    J[3, 2, i] -= aux * Pj.Gamma1
                    # j=3
                    J[1, 3, i] -= aux * Pj.Gamma2
                    J[2, 3, i] += aux * Pj.Gamma1

                end
            end

        end
    end

end
1 Like

No, it wouldn’t make a difference. Also, I doubt that @simd annotation is doing anything here, due to the access pattern in J, and the branching on r.