Optimizing remainder with constant modulus implementation

pjentsch0 · April 5, 2021, 12:55am

Hi performance people! I am implementing the remainder routines for constant modulus from this paper by Lemire et al. in Julia from their C examples.

Below is the original C implementation followed by my julia translation.

C code from Lemire et al (2019)

uint32_t d = ...;// your divisor > 0

uint64_t c = UINT64_C(0xFFFFFFFFFFFFFFFF) / d + 1;

// fastmod computes (n mod d) given precomputed c
uint32_t fastmod(uint32_t n ) {
  uint64_t lowbits = c * n;
  return ((__uint128_t)lowbits * d) >> 64; 
}

Julia code

struct Mod
    modulo::UInt32
    modulo_inv::UInt64
    function Mod(d::UInt32)
        cmod::UInt64 = div(typemax(UInt64), d) + 1
        return new(d,cmod)
    end
    function Mod(d::Int)
        ud = convert(UInt32, d)
        return Mod(ud)
    end
end

@inline function fastmod(n::UInt32,T::Mod)::UInt32
    lowbits::UInt64 = T.modulo_inv * n
    return (convert(UInt128,lowbits) * T.modulo) >> 64
end

I have refactored it to use a struct to store the precomputation (d and c in the C code).

Are there are any optimizations I can add to the Julia version? Is convert the best translation for the typecasting in the C code? I haven’t done much micro-optimization so I am not sure how to optimize the code_native output or anything.

The repository for my package is here, but it’s basically the above plus tests at this point. It’s usually faster than Julia’s mod ( for positive 32 bit moduli), if you’re going to use a given modulus at least twice, but of course that depends.

Benchmarking code and results

function bench_mod(a,b,c)
    for m in b
        for i in eachindex(a)
            c[i] = mod(a[i],m)
        end
    end
end
function bench_fastmod(a,b,c)
    for m in b
        modm = Mod(m)
        for i in eachindex(a)
            c[i] = fastmod(a[i],modm)
        end
    end
end

function bench()
    a = convert(Vector{UInt32},sample(1:typemax(UInt32),100_000))
    b = convert(Vector{UInt32},sample(1:typemax(UInt32),100))
    c1 = similar(a)
    c2 = similar(a)
     
    bench_fastmod = @benchmark bench_fastmod($a,$b,$c2)
    bench_mod = @benchmark bench_mod($a,$b,$c1)
    display(bench_fastmod)
    display(bench_mod)
end

For these sizes of a and b, on my Ryzen CPU, it is about twice as quick.

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.892 ms (0.00% GC)
  median time:      4.952 ms (0.00% GC)
  mean time:        4.953 ms (0.00% GC)
  maximum time:     5.046 ms (0.00% GC)
  --------------
  samples:          1010
  evals/sample:     1
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.851 ms (0.00% GC)
  median time:      12.982 ms (0.00% GC)
  mean time:        12.973 ms (0.00% GC)
  maximum time:     13.201 ms (0.00% GC)
  --------------
  samples:          386
  evals/sample:     1

thanks!

JeffreySarnoff · April 6, 2021, 7:10am

(a) change convert(UInt128, lowbits) to (lowbits % UInt128) for quicker, safe widening conversion.

(b) your function fastmod has a declared return type of UInt32, however the return statement , without coercion, generates a UInt128 value. Best to be explicit. If you want to truncate, consider
return ((lowbits % UInt128) * T.modulo) >> 64)) % UInt32

pjentsch0 · April 6, 2021, 2:10pm

Yes the truncation is intentional, you are right I should make it explicit.

Thanks for the suggestion. Using % to change types appears to be slightly slower on my machine.

EDIT: Actually, this implementation seems to generate the same assembly that the paper shows (for Clang), so I think this is probably the ceiling. Julia is neat!

Topic		Replies	Views
Optimizing `powermod` Performance	12	286	June 20, 2024
Modular multiplication without overflow Performance	32	3315	November 23, 2022
The big difference in speed of remainder calculations General Usage question	13	1118	April 13, 2020
Faster `div`, `rem` and `mod` for small integers? Performance	4	155	June 27, 2025
Why is fld(a, b) so much slower than floor(a / b)? General Usage	11	1231	November 11, 2019

Optimizing remainder with constant modulus implementation

Related topics