Vectorizing an awkward computation: any Julia magic?

I’m computing a matrix of values (specifically, a Wigner D matrix for a given value) via recurrence for a given input value R. The nature of this recurrence makes the memory-access patterns pretty ugly, and there’s really no chance for vectorization when computing one such matrix. However, I often have to compute that matrix for dozens to hundreds of values of R, so I’m thinking that I can make it an array, where the first dimension indexes the different R values. Then, through the magic of Julia it’s vectorized. (Please let me know if I’m wrong, or if there are tricks I need to look out for.)

Now, I already have this working for a single R value, though it is a little ugly, and requires some extra workspace, so I made special structs that are more than just the matrix. So I’m wondering if Julia has some fanciness that would make it easy to just use this code, and get vectorization. For example, StructArrays.jl is pretty fancy but doesn’t look like it would do the job, since inside it would just be a vector of matrices, which would have the wrong order for vectorization, right?

For context, here’s what a late part of this recurrence would like — compute some coefficients based on the indices, then combine values in this weird stencil pattern:

for m′ in range_m′
    a = √((ℓ-m′)*(ℓ+m′+1))
    b = √((ℓ-m′+1)*(ℓ+m′))
    for m in range_m
        c = √((ℓ-m+1)*(ℓ+m))
        d = √((ℓ-m)*(ℓ+m+1))
        Dˡ[m′+1, m] = (
            b * Dˡ[m′-1, m]
            - c * Dˡ[m′, m-1]
            + d * Dˡ[m′, m+1]
        ) / a
    end
end

Could you clarify what you mean by vectorization?

I assume you want to streamline your code such that it has less cache misses and can perform things like SIMD more easily? In that case, providing a small MWE that can actually be run will probably the best way to get a useful reply (some folks here really like the challenge of speeding up code :slight_smile: ).

But sometimes, vectorization can mean different things (e.g. broadcasting f.(x)), so I’m not sure that’s really what you are after.

1 Like

I asked chatgpt for an mwe of your case, may be wrong it gives

using LinearAlgebra, BenchmarkTools,LoopVectorization

@fastmath function wigner_D_single!(D, l, R,i)
    n = 2l + 1
    @inbounds for m in 1:n
        D[m, l+1,i] = cos(R)^(m-1)
    end
    for mp in 3:n  
        x = mp - l - 1
        a = sqrt((l - x) * (l + x + 1))
        b = sqrt((l - x + 1) * (l + x))
        @inbounds for m in 2:(n-1)      
            mm = m - l - 1
            c = sqrt((l - mm + 1) * (l + mm))
            d = sqrt((l - mm) * (l + mm + 1))
            D[m, mp,i] = (b * D[m, mp-2,i] - c * D[m-1, mp,i] + d * D[m+1, mp,i]) / (a + 1e-12)
        end
    end
    return D
end
function wigner_D_all!(Dall,l::Int, Rs::AbstractVector{Float64})
    n = 2l + 1
    N = length(Rs)
    @inbounds @simd for i in 1:N
        wigner_D_single!(Dall, l, Rs[i],i)
    end
    return Dall
end

# --- Test ---
l = 200
Rs = collect(range(0, pi; length=200))
n = 2l + 1
N = length(Rs)
Dall = zeros(n, n, N);
@btime wigner_D_all!($Dall,$l, $Rs);

which gives 271.400 ms (0 allocations: 0 bytes), note that its also trivial to paralel if you want to. However don’t expect any simd on the mp and m loops since each iteration depends on the one before. it was still able to do some though

        vmovq   xmm6, r15
        vmovq   xmm7, rbx
        vpunpcklqdq     xmm6, xmm7, xmm6        # xmm6 = xmm7[0],xmm6[0]
        vpaddq  xmm6, xmm6, xmm5
        vpcmpeqd        xmm7, xmm7, xmm7
        vxorpd  xmm8, xmm8, xmm8
        vgatherqpd      xmm8, xmmword ptr [r10 + 8*xmm6], xmm7
        vunpcklpd       xmm0, xmm0, xmm1        # xmm0 = xmm0[0],xmm1[0]
        vsqrtpd xmm0, xmm0
        vmulpd  xmm0, xmm8, xmm0
        vmovsd  xmm1, qword ptr [r11 + 8*r15 - 8] # xmm1 = mem[0],zero
        vfmsub132sd     xmm1, xmm0, xmm3        # xmm1 = (xmm1 * xmm3) - xmm0
        vpermilpd       xmm0, xmm0, 1           # xmm0 = xmm0[1,0]
        vaddsd  xmm0, xmm0, xmm1
        vdivsd  xmm0, xmm0, xmm4
        vmovsd  qword ptr [rdi + 8*r15 - 8], xmm0
        lea     rbx, [rax + r15]

Try @turbo on the nested loops to use LoopVectorization.jl.

isn’t @turbo supposed to be used for loops with no dependency between iterations like @simd ? Also it doesn’t make it better

Good points, all around. The m′ loop can’t be reordered, and the m loop in my example could be, but it’s actually triangular — its limits depend on m′. So I guess I can’t use @turbo there. The solution seems to be to just rewrite my code to add another dimension as the first axis in , include that as a new inner loop, and just apply @turbo to that.

I’ve come up with a more complete example that shows @turbo gets me a 1.5-2.5x speedup:

using Chairmarks
using LoopVectorization

function recurrence!(D, N, ℓ)
    @inbounds for m′ in ℓ:2ℓ
        a = √abs((ℓ-m′)*(ℓ+m′+1))
        b = √abs((ℓ-m′+1)*(ℓ+m′))
        for m in m′:2ℓ
            c = √abs((ℓ-m+1)*(ℓ+m))
            d = √abs((ℓ-m)*(ℓ+m+1))
            @turbo for i in 1:N
                 D[i, m′, m] = (
                    b * D[i, m′-1, m]
                    - c * D[i, m′, m-1]
                    + d * D[i, m′, m+1]
                ) / a
            end
        end
    end
    return D
end

ℓ = 8
N = (ℓ+1)^2
D = randn(Float64, N, 2ℓ+1, 2ℓ+1)
@b recurrence(D, N, ℓ)

That’s certainly enough reason to do that manual recoding.

1 Like

Don’t forget to switch the last two dim of D if you can for memory alignment, also writing it this way means no parallelism after make sure that’s ok for you.