I would like to compute the PDF of a multivariate normal distribution of given parameters for a number of input points. (eg MvNorm for several vectors at once) This is for a particle filter, many such computations going on, so I would like to squeeze all performance I can from my machine.
The essential part of the calculation is f(d, x)=d.C * exp(quad(d.A, x - d.M))
. x
might go from 2 to 12 dimensions. I’m hoping we can get a good speedup by using SIMD and Float32, there’s not only the linear arithmetics part, but maybe exp
itself could be vectorized.
My questions:
- Is there anything I should try before writing a custom SIMD function using SIMD.jl? It seems I haven’t gotten any vectorization otherwise.
- Specifically regarding the exponential, can it be vectorized in principle? I tried using
exp
withSIMD.Vec
and I don’t know what this result means. Is it merely callingexp
for each vector element serially?
julia> using SIMD
julia> vv = SIMD.Vec(randn(Float32,8)...)
<8 x Float32>[-1.2161875, -1.6569077, -1.0562038, -2.9222453, 0.28052512, 0.99889153, 0.88955605, 0.116273805]
julia> @code_llvm exp(vv)
; @ /home/user/.julia/packages/SIMD/xlBiA/src/simdvec.jl:156 within `exp`
define void @julia_exp_564([1 x <8 x float>]* noalias nocapture noundef nonnull sret([1 x <8 x float>]) align 32 dereferenceable(32) %0, [1 x <8 x float>]* nocapture noundef nonnull readonly align 16 dereferenceable(32) %1) #0 {
top:
; ┌ @ Base.jl:37 within `getproperty`
%2 = getelementptr inbounds [1 x <8 x float>], [1 x <8 x float>]* %1, i64 0, i64 0
; └
; @ /home/user/.julia/packages/SIMD/xlBiA/src/simdvec.jl:156 within `exp` @ /home/user/.julia/packages/SIMD/xlBiA/src/LLVM_intrinsics.jl:127
; ┌ @ /home/user/.julia/packages/SIMD/xlBiA/src/LLVM_intrinsics.jl:131 within `macro expansion`
%3 = load <8 x float>, <8 x float>* %2, align 16
%4 = call <8 x float> @llvm.exp.v8f32(<8 x float> %3)
; └
; @ /home/user/.julia/packages/SIMD/xlBiA/src/simdvec.jl:156 within `exp`
%5 = getelementptr inbounds [1 x <8 x float>], [1 x <8 x float>]* %0, i64 0, i64 0
store <8 x float> %4, <8 x float>* %5, align 32
ret void
}