Code optimization using SIMD or LoopVectorization

Dear Julia experts,

I have implemented the probability density function for the Pearson-IV distribution. This implementation is based on the R-package “PearsonDS”. Here is the code:

import SpecialFunctions: lgamma, gamma
using SIMD

const M_LN_SQRT_PI = 0.5log(pi)

@inline C_logPearsonIVnorm0(m, nu) = 
    (-M_LN_SQRT_PI - lgamma(m) - lgamma(m-0.5) + 2real(lgamma(m+0.5nu*im)))

function dpearsonIV_simd(
    x::Vector{Float64}; 
    params=[],
    log_p=false
    )::Vector{Float64}
    n = length(x)
    (m, nu, location, scale) = params
    @assert ((scale > 0) && (m > 0.5))
    inv_scale = 1/scale
    ret = similar(x)
    t = Vec{4,Float64}((0,0,0,0))
    s = Vec{4,Float64}((0,0,0,0))
    k = C_logPearsonIVnorm0(m, nu) - log(scale)
    @inbounds for i = 1:4:n
            t = vload(Vec{4,Float64}, x,   i)
            @fastmath t = inv_scale * (t-location)
            @fastmath s = Vec{4,Float64}((atan(t[1]),atan(t[2]),atan(t[3]),atan(t[4])))
            # @fastmath s = (-nu) * s
            # @fastmath t = t*t+1
            # @fastmath t = log(t)
            @fastmath t = (-nu) * s + (-m) * log(t*t+1) + k
            if !log_p
                @fastmath t = exp(t)
            end
            vstore(t, ret, i)
    end
    return ret
end

I have benchmarked this code and found that the function costs 31 microseconds on a 1000-element input array x. But the essential computations, namely the multipilcations and atan, cost approximately 8 microseconds.

I want to learn some suggestions on my code to make further improvements, since it will be used frequently in very complicated fitting algorithms. Thanks!

The main suggestion I have is that you are likely this at too low a level. This looks like it would be a good place for LoopVectorization. With that, this would simplify to

using LoopVectorization
function dpearsonIV_simd(
    x::Vector{Float64}; 
    params,
    log_p=false)
    n = length(x)
    (m, nu, location, scale) = params
    @assert ((scale > 0) && (m > 0.5))
    inv_scale = 1/scale
    k = C_logPearsonIVnorm0(m, nu) - log(scale)
    ret = @turbo @. atan(inv_scale * (x-location))
    @turbo @. ret = (-nu) * s + (-m) * log(ret*ret+1) + k
    if !log_p
        @turbo @. ret = exp(ret)
    end
    return ret
end
1 Like

Oh this is great! dpearsonIV_lv is your version with LoopVecterization.jl. You can see that it is indeed helpful by cutting the execution time more than half. I have monitored the CPU usage and confirmed that the main job is done within one core.

Thank you @Oscar_Smith !

Oh yeah, this is probably slow enough that you might want to multithread it by using @tturbo instead of @turbo

That would be even nicer. I have checked the LoopVectorization,jl source code, but I don’t have a sharp eye to pin-point the place where @turbo or rather turbo_macro() does the real job on my case to improve performance. But I wish to learn that. Can you give me some hints?

It’s just

function dpearsonIV_simd(
    x::Vector{Float64}; 
    params,
    log_p=false)
    n = length(x)
    (m, nu, location, scale) = params
    @assert ((scale > 0) && (m > 0.5))
    inv_scale = 1/scale
    k = C_logPearsonIVnorm0(m, nu) - log(scale)
    ret = @tturbo @. atan(inv_scale * (x-location))
    @tturbo @. ret = (-nu) * s + (-m) * log(ret*ret+1) + k
    if !log_p
        @tturbo @. ret = exp(ret)
    end
    return ret
end

There’s not really a specific “the place”.
But, compared to SIMD.jl, it uses SLEEFPirates.jl for faster log and `astandard.

Using VectorizationBase.jl instead of SIMD.jl might give you similar performance, but LoopVectorization.jl will try and do more clever things (that may or may not pay off).

@Oscar, might be worth fusing loops if using @tturbo? I haven’t benchmarks or checked assembly, but the tradeoff is probably register spills vs threading overhead.
Register spill cost will obviously be different for 16 vs 32 architectural/ named registers.

1 Like

A simple replacement of atan by SLEEFPirates.atan_fast give me an improvement of ~1.5x:


(To get the above benchmark result, I have just replaced atan by atan_fast in my original implementation of dpearsonIV_simd() and all other settings remain the same.)