Optimizing remainder with constant modulus implementation

Hi performance people! I am implementing the remainder routines for constant modulus from this paper by Lemire et al. in Julia from their C examples.

Below is the original C implementation followed by my julia translation.

C code from Lemire et al (2019)

uint32_t d = ...;// your divisor > 0

uint64_t c = UINT64_C(0xFFFFFFFFFFFFFFFF) / d + 1;

// fastmod computes (n mod d) given precomputed c
uint32_t fastmod(uint32_t n ) {
  uint64_t lowbits = c * n;
  return ((__uint128_t)lowbits * d) >> 64; 
}

Julia code

struct Mod
    modulo::UInt32
    modulo_inv::UInt64
    function Mod(d::UInt32)
        cmod::UInt64 = div(typemax(UInt64), d) + 1
        return new(d,cmod)
    end
    function Mod(d::Int)
        ud = convert(UInt32, d)
        return Mod(ud)
    end
end

@inline function fastmod(n::UInt32,T::Mod)::UInt32
    lowbits::UInt64 = T.modulo_inv * n
    return (convert(UInt128,lowbits) * T.modulo) >> 64
end

I have refactored it to use a struct to store the precomputation (d and c in the C code).

Are there are any optimizations I can add to the Julia version? Is convert the best translation for the typecasting in the C code? I haven’t done much micro-optimization so I am not sure how to optimize the code_native output or anything.

The repository for my package is here, but it’s basically the above plus tests at this point. It’s usually faster than Julia’s mod ( for positive 32 bit moduli), if you’re going to use a given modulus at least twice, but of course that depends.

Benchmarking code and results

function bench_mod(a,b,c)
    for m in b
        for i in eachindex(a)
            c[i] = mod(a[i],m)
        end
    end
end
function bench_fastmod(a,b,c)
    for m in b
        modm = Mod(m)
        for i in eachindex(a)
            c[i] = fastmod(a[i],modm)
        end
    end
end

function bench()
    a = convert(Vector{UInt32},sample(1:typemax(UInt32),100_000))
    b = convert(Vector{UInt32},sample(1:typemax(UInt32),100))
    c1 = similar(a)
    c2 = similar(a)
     
    bench_fastmod = @benchmark bench_fastmod($a,$b,$c2)
    bench_mod = @benchmark bench_mod($a,$b,$c1)
    display(bench_fastmod)
    display(bench_mod)
end

For these sizes of a and b, on my Ryzen CPU, it is about twice as quick.

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.892 ms (0.00% GC)
  median time:      4.952 ms (0.00% GC)
  mean time:        4.953 ms (0.00% GC)
  maximum time:     5.046 ms (0.00% GC)
  --------------
  samples:          1010
  evals/sample:     1
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.851 ms (0.00% GC)
  median time:      12.982 ms (0.00% GC)
  mean time:        12.973 ms (0.00% GC)
  maximum time:     13.201 ms (0.00% GC)
  --------------
  samples:          386
  evals/sample:     1

thanks!

(a) change convert(UInt128, lowbits) to (lowbits % UInt128) for quicker, safe widening conversion.

(b) your function fastmod has a declared return type of UInt32, however the return statement , without coercion, generates a UInt128 value. Best to be explicit. If you want to truncate, consider
return ((lowbits % UInt128) * T.modulo) >> 64)) % UInt32

1 Like

Yes the truncation is intentional, you are right I should make it explicit.

Thanks for the suggestion. Using % to change types appears to be slightly slower on my machine.

EDIT: Actually, this implementation seems to generate the same assembly that the paper shows (for Clang), so I think this is probably the ceiling. Julia is neat!