Vectorization of multivariate normal PDF

I would like to compute the PDF of a multivariate normal distribution of given parameters for a number of input points. (eg MvNorm for several vectors at once) This is for a particle filter, many such computations going on, so I would like to squeeze all performance I can from my machine.

The essential part of the calculation is f(d, x)=d.C * exp(quad(d.A, x - d.M)). x might go from 2 to 12 dimensions. I’m hoping we can get a good speedup by using SIMD and Float32, there’s not only the linear arithmetics part, but maybe exp itself could be vectorized.

My questions:

  • Is there anything I should try before writing a custom SIMD function using SIMD.jl? It seems I haven’t gotten any vectorization otherwise.
  • Specifically regarding the exponential, can it be vectorized in principle? I tried using exp with SIMD.Vec and I don’t know what this result means. Is it merely calling exp for each vector element serially?
julia> using SIMD

julia> vv = SIMD.Vec(randn(Float32,8)...)
<8 x Float32>[-1.2161875, -1.6569077, -1.0562038, -2.9222453, 0.28052512, 0.99889153, 0.88955605, 0.116273805]

julia> @code_llvm exp(vv)
;  @ /home/user/.julia/packages/SIMD/xlBiA/src/simdvec.jl:156 within `exp`
define void @julia_exp_564([1 x <8 x float>]* noalias nocapture noundef nonnull sret([1 x <8 x float>]) align 32 dereferenceable(32) %0, [1 x <8 x float>]* nocapture noundef nonnull readonly align 16 dereferenceable(32) %1) #0 {
top:
; ┌ @ Base.jl:37 within `getproperty`
   %2 = getelementptr inbounds [1 x <8 x float>], [1 x <8 x float>]* %1, i64 0, i64 0
; └
;  @ /home/user/.julia/packages/SIMD/xlBiA/src/simdvec.jl:156 within `exp` @ /home/user/.julia/packages/SIMD/xlBiA/src/LLVM_intrinsics.jl:127
; ┌ @ /home/user/.julia/packages/SIMD/xlBiA/src/LLVM_intrinsics.jl:131 within `macro expansion`
   %3 = load <8 x float>, <8 x float>* %2, align 16
   %4 = call <8 x float> @llvm.exp.v8f32(<8 x float> %3)
; └
;  @ /home/user/.julia/packages/SIMD/xlBiA/src/simdvec.jl:156 within `exp`
  %5 = getelementptr inbounds [1 x <8 x float>], [1 x <8 x float>]* %0, i64 0, i64 0
  store <8 x float> %4, <8 x float>* %5, align 32
  ret void
}

Have you tried representing vectors using StaticArrays.jl?