I often encounter problems like the following where I want to vectorize a short numerical algorithm but keep the code as high level as possible so I can change the precision, filter length, or other parameters. This function returns a function that takes the dot product with a vector where each element is the evaluation of a polynomial. The loop and polynomial is fully unrolled so there are no unknown lengths when the final function is compiled.
#https://stackoverflow.com/questions/28077057/julia-evalpoly-macro-with-varargs
function get_farrow_dot(fcoeff::Array{Float64,2})
fc = fcoeff[:,end:-1:1]
flen = size(fc,1)
ex = :0
for i = 1:flen
ex = :(muladd(s[$flen-$i+1],@evalpoly(α,$(fc[i,:]...)),$ex))
end
eval(:(function (α::Float64,s::Vector{Float64}) @fastmath @inbounds return $ex end))
end
I couldn’t get Julia/LLVM to vectorize. There are a few potential issues. If the loop is too small, it won’t be vectorized. Also, the polynomial coefficients are expanded into a vararg tuple of constants; so, I thought maybe LLVM wasn’t seeing the simple matrix structure. However, I came up with the following simplified function which does not vectorize either.
function test(α::Float64,s::Vector{Float64},fc::Array{Float64,2})
tot=0.0
@simd for i=1:200
@fastmath @inbounds tot+=muladd(s[i],@evalpoly(α,fc[i,1],fc[i,2],fc[i,3],fc[i,4],fc[i,5],fc[i,6],fc[i,7],fc[i,8]),tot)
end
tot
end
A straightforward vectorization without worrying about instruction latency or cache demonstrates this can be vectorized. In fact, GCC does an excellent job without any optimizations from me.
julia> t=randn(48);
julia> fdot=get_farrow_dot(farrow_coeff);
julia> @btime fdot(-.11,$t)
166.033 ns (1 allocation: 16 bytes)
1.3992278257126003
julia> @btime fdot_c(-.11,$t,$(farrow_coeff[end:-1:1,:]))
51.351 ns (0 allocations: 0 bytes)
1.3992278257125998
julia> @btime fdot_copt(-.11,$t,$(farrow_coeff[end:-1:1,:]))
26.808 ns (0 allocations: 0 bytes)
1.3992278257125998
At this point, I’m stuck. I’m not sure if there is a better way to express this so that it yields something like my manually optimized version, or if it’s a limitation of Julia or LLVM. It would be a big deal if I could get this to work. It takes a lot of time even to vectorize something simple like this.