yeah, i don’t seem to know what my question is. below you find 3.
i have a type Mod{B, A}
which just encapsulates an UInt64 array.
and a naive function on it:
@inline limbs(b) = cld(b, 31)
mul!{B, A}(r::Mod{B, A}, a::Mod{B, A}, b::Mod{B, A}) = begin
@inbounds begin
for i in 1:limbs(B)
r.n[i] = zero(UInt64)
end
for i in 1:limbs(B), j in 1:limbs(B)
r.n[i] += a.n[mod(i-j, limbs(B)) + 1] * b.n[j] * (j > i ? UInt64(A) : UInt64(1))
end
end
r
end
the idea here is that every loop bound is known at compile time, the loops would get unrolled, and so the mod and the conditional does not even get into the native code. however, it seems that i exceeded some unroll threshold, and i get only the inner loop unrolled. the mod
hurts especially badly, i tried using an optimized mod, which is good, but unrolling would be even better. so question #1: is there a way to force unrolling?
i tried to use @unroll as in Unroll.jl, or more precisely a customized variant of it. however, it appears that macros are expanded before B is known, so not the value but just a symbol is passed. dead end? question #2: do i give up on macros for this?
i wound up doing a generated function. it makes super optimized code, and the speed significantly improved. however, i seem to recall that generated functions have some problems, maybe with precompiling, but some cursory googling did not help. so question #3: are there any drawbacks to generated functions?
below my creation. a little bit write only, but blazing fast.
@generated mul!{B, A}(r::Mod{B, A}, a::Mod{B, A}, b::Mod{B, A}) = begin
li = limbs(B)
r = esc(r); a = esc(a); b = esc(b)
quote
@inbounds begin
$([quote
t = UInt64(0)
$([ :( t += a.n[$(mod(i-j, li)+1)] * b.n[$j] ) for j = i+1:li]...)
t *= $(UInt64(A))
$([ :( t += a.n[$(mod(i-j, li)+1)] * b.n[$j] ) for j = 1:i]...)
r.n[$i] = t
end for i = 1:li ]...)
end
r
end
end