Loop unrolling, type param to macro, generated functions

question

#1

yeah, i don’t seem to know what my question is. below you find 3.

i have a type Mod{B, A} which just encapsulates an UInt64 array.

and a naive function on it:

@inline limbs(b) = cld(b, 31)

mul!{B, A}(r::Mod{B, A}, a::Mod{B, A}, b::Mod{B, A}) = begin
  @inbounds begin
    for i in 1:limbs(B)
      r.n[i] = zero(UInt64)
    end
    for i in 1:limbs(B), j in 1:limbs(B)      
      r.n[i] += a.n[mod(i-j, limbs(B)) + 1] * b.n[j] * (j > i ? UInt64(A) : UInt64(1))
    end
  end
  r
end

the idea here is that every loop bound is known at compile time, the loops would get unrolled, and so the mod and the conditional does not even get into the native code. however, it seems that i exceeded some unroll threshold, and i get only the inner loop unrolled. the mod hurts especially badly, i tried using an optimized mod, which is good, but unrolling would be even better. so question #1: is there a way to force unrolling?

i tried to use @unroll as in Unroll.jl, or more precisely a customized variant of it. however, it appears that macros are expanded before B is known, so not the value but just a symbol is passed. dead end? question #2: do i give up on macros for this?

i wound up doing a generated function. it makes super optimized code, and the speed significantly improved. however, i seem to recall that generated functions have some problems, maybe with precompiling, but some cursory googling did not help. so question #3: are there any drawbacks to generated functions?

below my creation. a little bit write only, but blazing fast.

@generated mul!{B, A}(r::Mod{B, A}, a::Mod{B, A}, b::Mod{B, A}) = begin
  li = limbs(B)
  r = esc(r); a = esc(a); b = esc(b)
  quote
    @inbounds begin
      $([quote
          t = UInt64(0)
          $([ :( t += a.n[$(mod(i-j, li)+1)] * b.n[$j] ) for j = i+1:li]...)
          t *= $(UInt64(A))
          $([ :( t += a.n[$(mod(i-j, li)+1)] * b.n[$j] ) for j = 1:i]...)
          r.n[$i] = t
        end for i = 1:li ]...)
    end
    r
  end
end

#2

ntuple with a Val argument will be completely unrolled at compile time. See Jeff’s trick with the circularshift! function at: https://github.com/stevengj/18S096-iap17/blob/master/lecture3/Types%20and%20Dispatch.ipynb


#3

it appears to me that the same threshold applies to this case. see

@code_native circularshiftN!(ones(100), Val{50}())

the loop is not unrolled anymore. i think it gets more eagerly unrolled because the loop body is very simple. there must be some limitation on the “total number of things after unrolling”, possibly similar to the inlining logic. however, this limit is maybe applied before the massive optimization of llvm can take place?


#4

A trick I recently used for loop unrolling with macros is to use nested macros.

macro unrollit( n, body )
...
end

macro dowithunroll(n)
  quote
    ...   
    @unrollit( n,  <stuff to unroll> ) 
    ....
  end
end

function nknown()
   withunroll(31)
end

The top level function calls a macro with a constant known at parse-time. Then that macro can call a more generic unrolling macro with that constant. In my case dowithunroll was a function that I converted to a macro and block quoted.

Would be cleaner if I could keep dowithunroll as a function, since I’m only converting to a macro so I can pass this parse-time constant down to the generic unrolling macro.


#5

I wrote a macro that expands into the generated function: Unrolled.jl (not registered). Your example doesn’t loop over sequences, but over 1:N, so either look at the source and adapt it, or you could implement a CompileTimeUnitRange{A, B}() type and use a helper function. Something like:

mul!{B, A}(r::Mod{B, A}, a::Mod{B, A}, b::Mod{B, A}) = mul!_helper(r, a, b, CompileTimeUnitRange{1, limbs(B)}(), A)
@unroll function mul!_helper(r, a, b, limbs, A)
  @inbounds begin
      @unroll for i in limbs
        r.n[i] = zero(UInt64)
      end
      @unroll for i in limbs
        @unroll for j in limbs    
          r.n[i] += a.n[mod(i-j, length(limbs)) + 1] * b.n[j] * (j > i ? UInt64(A) : UInt64(1))
        end
      end
    end
  r
end