Unroll and Vectorize EvalPoly

goldfita · May 21, 2017, 3:09pm

I often encounter problems like the following where I want to vectorize a short numerical algorithm but keep the code as high level as possible so I can change the precision, filter length, or other parameters. This function returns a function that takes the dot product with a vector where each element is the evaluation of a polynomial. The loop and polynomial is fully unrolled so there are no unknown lengths when the final function is compiled.

#https://stackoverflow.com/questions/28077057/julia-evalpoly-macro-with-varargs
function get_farrow_dot(fcoeff::Array{Float64,2})
    fc = fcoeff[:,end:-1:1]
    flen = size(fc,1)
    ex = :0
    for i = 1:flen
        ex = :(muladd(s[$flen-$i+1],@evalpoly(α,$(fc[i,:]...)),$ex))
    end
    eval(:(function (α::Float64,s::Vector{Float64}) @fastmath @inbounds return $ex end))
end

I couldn’t get Julia/LLVM to vectorize. There are a few potential issues. If the loop is too small, it won’t be vectorized. Also, the polynomial coefficients are expanded into a vararg tuple of constants; so, I thought maybe LLVM wasn’t seeing the simple matrix structure. However, I came up with the following simplified function which does not vectorize either.

function test(α::Float64,s::Vector{Float64},fc::Array{Float64,2})
    tot=0.0
   @simd for i=1:200
         @fastmath @inbounds  tot+=muladd(s[i],@evalpoly(α,fc[i,1],fc[i,2],fc[i,3],fc[i,4],fc[i,5],fc[i,6],fc[i,7],fc[i,8]),tot)
    end
    tot
end

A straightforward vectorization without worrying about instruction latency or cache demonstrates this can be vectorized. In fact, GCC does an excellent job without any optimizations from me.

julia> t=randn(48);
julia> fdot=get_farrow_dot(farrow_coeff);
julia> @btime fdot(-.11,$t)
  166.033 ns (1 allocation: 16 bytes)
1.3992278257126003
julia> @btime fdot_c(-.11,$t,$(farrow_coeff[end:-1:1,:]))
  51.351 ns (0 allocations: 0 bytes)
1.3992278257125998
julia> @btime fdot_copt(-.11,$t,$(farrow_coeff[end:-1:1,:]))
  26.808 ns (0 allocations: 0 bytes)
1.3992278257125998

At this point, I’m stuck. I’m not sure if there is a better way to express this so that it yields something like my manually optimized version, or if it’s a limitation of Julia or LLVM. It would be a big deal if I could get this to work. It takes a lot of time even to vectorize something simple like this.

yuyichao · May 21, 2017, 3:45pm

LLVM vectorization doesn’t seem to handle muladd well. Replace it with * and + and it’ll vectorize. (Also note that in the julia version you are doubling tot everytime)

goldfita · May 21, 2017, 6:44pm

The code is nearly identical with muladd replaced by a*b+c in test. I believe it’s using the bottom 64 bits of the XMM registers.

        leaq    dlatps64_(%r10,%rcx,8), %rdx
        vmovsd  (%r10,%rcx,8), %xmm1    # xmm1 = mem[0],zero
        vfmadd213sd     (%rax,%rdx), %xmm0, %xmm1
        addq    %rax, %rdx
        vfmadd213sd     (%rax,%rdx), %xmm0, %xmm1
        addq    %rax, %rdx
        vfmadd213sd     (%rax,%rdx), %xmm0, %xmm1
        addq    %rax, %rdx
        vfmadd213sd     (%rax,%rdx), %xmm0, %xmm1
        addq    %rax, %rdx
        vfmadd213sd     (%rax,%rdx), %xmm0, %xmm1
        addq    %rax, %rdx
        vfmadd213sd     (%rax,%rdx), %xmm0, %xmm1
        addq    %rax, %rdx
        vfmadd213sd     (%rax,%rdx), %xmm0, %xmm1
Source line: 4
        vmulsd  (%r9,%rcx,8), %xmm1, %xmm1
        vmovq   %r8, %xmm2
        vaddsd  %xmm2, %xmm1, %xmm1
        vmovq   %xmm1, %r8

yuyichao · May 21, 2017, 7:19pm

Because you also need to fix

i.e. you are doing tot += muladd(..., ..., tot). This appears to affect the vectorization decision.

goldfita · May 21, 2017, 7:23pm

I had removed the extra +. It still didn’t vectorize. I think the vector is also backwards. I hadn’t meant for test to be functionally correct – just a simplified version to experiment on.

yuyichao · May 21, 2017, 8:24pm

The following works for me at least…

julia> function test(α::Float64,s::Vector{Float64},fc::Array{Float64,2})
           tot=0.0
           @simd for i=1:200
                @fastmath @inbounds tot += s[i] * @evalpoly(α,fc[i,1],fc[i,2],fc[i,3],fc[i,4],fc[i,5],fc[i,6],fc[i,7],fc[i,8])
           end
           tot
       end
test (generic function with 1 method)

julia> @code_native test(1.0, Float64[], Matrix{Float64}(0, 0))
        .text
Filename: REPL[1]
        pushq   %rbp
        movq    %rsp, %rbp
        movq    24(%rsi), %rax
        movq    (%rdi), %rcx
Source line: 71
        vbroadcastsd    %xmm0, %ymm0
        imulq   $56, %rax, %rdx
        addq    (%rsi), %rdx
        shlq    $3, %rax
        negq    %rax
        vxorpd  %ymm1, %ymm1, %ymm1
        xorl    %esi, %esi
        vxorpd  %ymm2, %ymm2, %ymm2
        nopl    (%rax,%rax)
Source line: 129
L48:
        leaq    (%rdx,%rsi,8), %rdi
        vmovupd (%rdx,%rsi,8), %ymm3
        vmovupd 32(%rdx,%rsi,8), %ymm4
        vfmadd213pd     (%rax,%rdi), %ymm0, %ymm3
        vfmadd213pd     32(%rax,%rdi), %ymm0, %ymm4
        leaq    (%rdi,%rax), %rdi
        vfmadd213pd     (%rax,%rdi), %ymm0, %ymm3
        vfmadd213pd     32(%rax,%rdi), %ymm0, %ymm4
        leaq    (%rdi,%rax), %rdi
        vfmadd213pd     (%rax,%rdi), %ymm0, %ymm3
        vfmadd213pd     32(%rax,%rdi), %ymm0, %ymm4
        leaq    (%rdi,%rax), %rdi
        vfmadd213pd     (%rax,%rdi), %ymm0, %ymm3
        vfmadd213pd     32(%rax,%rdi), %ymm0, %ymm4
        leaq    (%rdi,%rax), %rdi
        vfmadd213pd     (%rax,%rdi), %ymm0, %ymm3
        vfmadd213pd     32(%rax,%rdi), %ymm0, %ymm4
        leaq    (%rdi,%rax), %rdi
        vfmadd213pd     (%rax,%rdi), %ymm0, %ymm3
        vfmadd213pd     32(%rax,%rdi), %ymm0, %ymm4
        leaq    (%rdi,%rax), %rdi
        vfmadd213pd     (%rax,%rdi), %ymm0, %ymm3
        vfmadd213pd     32(%rax,%rdi), %ymm0, %ymm4
Source line: 4
        vmulpd  (%rcx,%rsi,8), %ymm3, %ymm3
        vmulpd  32(%rcx,%rsi,8), %ymm4, %ymm4
        vaddpd  %ymm1, %ymm3, %ymm1
        vaddpd  %ymm2, %ymm4, %ymm2
Source line: 74
        addq    $8, %rsi
        cmpq    $200, %rsi
        jne     L48
Source line: 4
        vaddpd  %ymm1, %ymm2, %ymm0
        vextractf128    $1, %ymm0, %xmm1
        vaddpd  %ymm1, %ymm0, %ymm0
        vhaddpd %ymm0, %ymm0, %ymm0
Source line: 6
        popq    %rbp
        vzeroupper
        retq

goldfita · May 21, 2017, 11:27pm

Maybe my LLVM is too old?

Julia Version 0.6.0-rc2.0
Commit 68e911be53 (2017-05-18 02:31 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)

yuyichao · May 22, 2017, 12:38am

This is LLVM 4.0 on Linux.

Julia Version 0.7.0-DEV.176
Commit 50c2a378ba* (2017-05-15 00:13 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E3-1505M v6 @ 3.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-4.0.0 (ORCJIT, skylake)

cstjean · May 22, 2017, 3:02am

I don’t know if it’ll help, but FYI there’s a package for fixed-length arrays with automatically-unrolled operations.

goldfita · May 22, 2017, 3:43am

Thanks for the tip. This will be helpful for other things I’m working on. I can wait for the LLVM version to be bumped up to 4.0 in a future release. It’s also possible to select the version as part of the build, but I’m not sure if that will lead to an unstable Julia.