Vectorizing an awkward computation: any Julia magic?

moble · October 16, 2025, 4:25pm

I’m computing a matrix of values (specifically, a Wigner D matrix for a given ℓ value) via recurrence for a given input value R. The nature of this recurrence makes the memory-access patterns pretty ugly, and there’s really no chance for vectorization when computing one such matrix. However, I often have to compute that matrix for dozens to hundreds of values of R, so I’m thinking that I can make it an array, where the first dimension indexes the different R values. Then, through the magic of Julia it’s vectorized. (Please let me know if I’m wrong, or if there are tricks I need to look out for.)

Now, I already have this working for a single R value, though it is a little ugly, and requires some extra workspace, so I made special structs that are more than just the matrix. So I’m wondering if Julia has some fanciness that would make it easy to just use this code, and get vectorization. For example, StructArrays.jl is pretty fancy but doesn’t look like it would do the job, since inside it would just be a vector of matrices, which would have the wrong order for vectorization, right?

For context, here’s what a late part of this recurrence would like — compute some coefficients based on the indices, then combine values in this weird stencil pattern:

for m′ in range_m′
    a = √((ℓ-m′)*(ℓ+m′+1))
    b = √((ℓ-m′+1)*(ℓ+m′))
    for m in range_m
        c = √((ℓ-m+1)*(ℓ+m))
        d = √((ℓ-m)*(ℓ+m+1))
        Dˡ[m′+1, m] = (
            b * Dˡ[m′-1, m]
            - c * Dˡ[m′, m-1]
            + d * Dˡ[m′, m+1]
        ) / a
    end
end

Sevi · October 18, 2025, 10:13am

Could you clarify what you mean by vectorization?

I assume you want to streamline your code such that it has less cache misses and can perform things like SIMD more easily? In that case, providing a small MWE that can actually be run will probably the best way to get a useful reply (some folks here really like the challenge of speeding up code ).

But sometimes, vectorization can mean different things (e.g. broadcasting f.(x)), so I’m not sure that’s really what you are after.

yolhan_mannes · October 18, 2025, 10:58am

I asked chatgpt for an mwe of your case, may be wrong it gives

using LinearAlgebra, BenchmarkTools,LoopVectorization

@fastmath function wigner_D_single!(D, l, R,i)
    n = 2l + 1
    @inbounds for m in 1:n
        D[m, l+1,i] = cos(R)^(m-1)
    end
    for mp in 3:n  
        x = mp - l - 1
        a = sqrt((l - x) * (l + x + 1))
        b = sqrt((l - x + 1) * (l + x))
        @inbounds for m in 2:(n-1)      
            mm = m - l - 1
            c = sqrt((l - mm + 1) * (l + mm))
            d = sqrt((l - mm) * (l + mm + 1))
            D[m, mp,i] = (b * D[m, mp-2,i] - c * D[m-1, mp,i] + d * D[m+1, mp,i]) / (a + 1e-12)
        end
    end
    return D
end
function wigner_D_all!(Dall,l::Int, Rs::AbstractVector{Float64})
    n = 2l + 1
    N = length(Rs)
    @inbounds @simd for i in 1:N
        wigner_D_single!(Dall, l, Rs[i],i)
    end
    return Dall
end

# --- Test ---
l = 200
Rs = collect(range(0, pi; length=200))
n = 2l + 1
N = length(Rs)
Dall = zeros(n, n, N);
@btime wigner_D_all!($Dall,$l, $Rs);

which gives 271.400 ms (0 allocations: 0 bytes), note that its also trivial to paralel if you want to. However don’t expect any simd on the mp and m loops since each iteration depends on the one before. it was still able to do some though

        vmovq   xmm6, r15
        vmovq   xmm7, rbx
        vpunpcklqdq     xmm6, xmm7, xmm6        # xmm6 = xmm7[0],xmm6[0]
        vpaddq  xmm6, xmm6, xmm5
        vpcmpeqd        xmm7, xmm7, xmm7
        vxorpd  xmm8, xmm8, xmm8
        vgatherqpd      xmm8, xmmword ptr [r10 + 8*xmm6], xmm7
        vunpcklpd       xmm0, xmm0, xmm1        # xmm0 = xmm0[0],xmm1[0]
        vsqrtpd xmm0, xmm0
        vmulpd  xmm0, xmm8, xmm0
        vmovsd  xmm1, qword ptr [r11 + 8*r15 - 8] # xmm1 = mem[0],zero
        vfmsub132sd     xmm1, xmm0, xmm3        # xmm1 = (xmm1 * xmm3) - xmm0
        vpermilpd       xmm0, xmm0, 1           # xmm0 = xmm0[1,0]
        vaddsd  xmm0, xmm0, xmm1
        vdivsd  xmm0, xmm0, xmm4
        vmovsd  qword ptr [rdi + 8*r15 - 8], xmm0
        lea     rbx, [rax + r15]

Elrod · October 18, 2025, 3:39pm

Try @turbo on the nested loops to use LoopVectorization.jl.

yolhan_mannes · October 18, 2025, 3:46pm

isn’t @turbo supposed to be used for loops with no dependency between iterations like @simd ? Also it doesn’t make it better

moble · October 18, 2025, 11:10pm

Good points, all around. The m′ loop can’t be reordered, and the m loop in my example could be, but it’s actually triangular — its limits depend on m′. So I guess I can’t use @turbo there. The solution seems to be to just rewrite my code to add another dimension as the first axis in Dˡ, include that as a new inner loop, and just apply @turbo to that.

I’ve come up with a more complete example that shows @turbo gets me a 1.5-2.5x speedup:

using Chairmarks
using LoopVectorization

function recurrence!(D, N, ℓ)
    @inbounds for m′ in ℓ:2ℓ
        a = √abs((ℓ-m′)*(ℓ+m′+1))
        b = √abs((ℓ-m′+1)*(ℓ+m′))
        for m in m′:2ℓ
            c = √abs((ℓ-m+1)*(ℓ+m))
            d = √abs((ℓ-m)*(ℓ+m+1))
            @turbo for i in 1:N
                 D[i, m′, m] = (
                    b * D[i, m′-1, m]
                    - c * D[i, m′, m-1]
                    + d * D[i, m′, m+1]
                ) / a
            end
        end
    end
    return D
end

ℓ = 8
N = (ℓ+1)^2
D = randn(Float64, N, 2ℓ+1, 2ℓ+1)
@b recurrence(D, N, ℓ)

That’s certainly enough reason to do that manual recoding.

yolhan_mannes · October 19, 2025, 4:19am

Don’t forget to switch the last two dim of D if you can for memory alignment, also writing it this way means no parallelism after make sure that’s ok for you.

Topic		Replies	Views
Problem with LoopVectorization : @turbo, LoopVectorization.check_args Performance loopvectorization	27	3016	October 26, 2022
Help optimizing simple example New to Julia question , performance	5	430	March 27, 2022
Vectorization of "complex" loops New to Julia	15	505	September 27, 2025
How to correctly unroll loop in LoopVectorization? Performance	12	383	February 15, 2024
[ANN] LoopVectorization Package Announcements	157	23879	May 27, 2020

Vectorizing an awkward computation: any Julia magic?

Related topics