How to construct a struct with common field values efficiently

I am now developing a code which makes use of a huge number of modulo arithmetic.
For a better readability, I tried to define custom types Modulus and ModNum.

Modulus contains some values which is essential for the modulo arithmetic (for example, the value of the modulus, Q, or some pre-computed values for montgomery reduction, etc.) and ModNum stores the value and modulus. Namely,

mutable struct Modulus
    Q::UInt64
    Qinv::UInt64
    ...
end

struct ModNum
    val::UInt64
    Q::Modulus
end

. And then, I could define overloading operations such as Base.+, Base.- or Base.* for two ModNums with the same field value of Q. However, one downside with this approach is that it copies Q whenever new ModNum is generated, and it slows down the code and also consumes more memory. (Using UInt64 is twice as fast as this approach, but the readability and usability is a bit poor.) Will there be any good way to keep the structure while avoiding memory consumption from copying Q?

Don’t make Modulus mutable. Once it’s an immutable struct, the copies will be (basically) free

Actually I have already tried that, but making Modulus mutable makes the code ~30% faster and the size of ModNum is much smaller.

Wait I’m reading this again. Why are you copying Q? You should be able to use the same Q as the inputs.

Non-mutable should be faster, but what’s hiding inside ... here? If you want performance advice it is necessary to share some runnable code, a minimal working example that demonstrates the issue.

1 Like

I was referring to using the same Q as input when generating ModNum. It works well, but it is not as efficient as not defining ModNum and use Q as input for the binary operations. (i.e. use UInt64 instead of ModNum and define functions like add(x::UInt64, y::UInt64, Q::Modulus))

So, this is my code for Montgomery multiplication.

mutable struct Modulus
    const Q::UInt64
    const Q⁻¹::UInt64
    const R⁻¹::UInt128
    const mask::UInt128
    const logR::Int64

    function Modulus(Q::Integer)
        @assert isodd(Q) "Modulus should be an odd number."

        Q⁻¹ = UInt64(invmod(-Q, 0x00000000000000010000000000000000))
        R⁻¹ = UInt128(invmod(0x00000000000000010000000000000000, Q))
        new(Q, Q⁻¹, R⁻¹, 0x0000000000000000ffffffffffffffff, 64)
    end
end

Base.:mod(x::Integer, Q::Modulus) = mod(x, Q.Q)
Base.:invmod(x::Integer, Q::Modulus) = invmod(UInt64(x), Q.Q)

struct ModNum <: Unsigned
    val::UInt64
    Q::Modulus
end

mform(x::ModNum) = ModNum(UInt64(mod(UInt128(mod(x.val, x.Q)) << x.Q.logR, x.Q)), x.Q)
imform(x::ModNum) = ModNum(UInt64(mod(x.val * x.Q.R⁻¹, x.Q)), x.Q)

add(a::UInt64, b::UInt64, Q::Modulus) = begin
    res = a + b
    if res ≥ Q.Q
        res -= Q.Q
    end
    res
end

Base.:+(a::ModNum, b::ModNum) = begin
    @assert a.Q.Q == b.Q.Q
    ModNum(add(a.val, b.val, a.Q), a.Q)
end

Base.:-(a::ModNum) = ModNum(a.Q.Q - a.val, a.Q)

sub(a::UInt64, b::UInt64, Q::Modulus) = begin
    res = a - b
    if res ≥ Q.Q
        res += Q.Q
    end
    res
end

Base.:-(a::ModNum, b::ModNum) = begin
    @assert a.Q.Q == b.Q.Q
    ModNum(sub(a.val, b.val, a.Q), a.Q)
end

mul(a::UInt64, b::UInt64, Q::Modulus) = begin
    q, q⁻¹ = Q.Q, Q.Q⁻¹

    ab = UInt128(a) * UInt128(b)
    w = UInt64((ab + q * UInt128(UInt64(ab & Q.mask) * q⁻¹)) >> Q.logR)
    w ≥ q ? w - q : w
end

Base.:*(a::ModNum, b::ModNum) = begin
    @assert a.Q.Q == b.Q.Q
    ModNum(mul(a.val, b.val, a.Q), a.Q)
end

The test code is as follows:

Q = Modulus(113)
a, b = ModNum(1, Q), ModNum(3, Q);

@btime begin
    for i = 1 : 1000
         a * b;
    end
end
12.292 μs (1000 allocations: 31.25 KiB)

@btime begin
    for i = 1 : 1000
        mul(0x0000000000000001, 0x0000000000000003, Q);
    end
end
2.181 μs (0 allocations: 0 bytes)

Of course using ModNum will be less efficient than just using UInt64, but I wonder if there is any nice way to keep away from (too many) allocations or make this construction faster.

This is a benchmarking artifact (since a and b are globals)

function f(n, A)
    a = ModNum(1,Q)
    for i in 1:n
        a *= ModNum(i,Q)
    end
    a
end
@btime f(1000, a);

you should see no allocations

EDIT: actually I am seeing the allocations. That’s very odd…

Also, once you get this working, I would very much appreciate if you PR’d it to GitHub - JuliaMath/IntegerMathUtils.jl since Primes.jl could be a good bit faster with a good implementation of Montgomery multiplication.

It gives me 1491 allocations in my laptop (macbook).

@btime f(1000, a);
  98.250 μs (1491 allocations: 38.95 KiB)

Regarding the correctness of the code, I opted out the unnecessary parts. Updated the code above so that it guarantees correctness. You need to call mform(a) to make the integer into the montgomery form, and call imform(a) to make it into the normal form.

Perhaps I am missing something but isn’t it simply that Q still is an untyped global?

2 Likes

You are missing the fact that I’m trying to help someone debug stuff at 2am :slight_smile:

julia> function f(n, Q)
           a = ModNum(1,Q)
           for i in 1:n
               a *= ModNum(i,Q)
           end
           a
       end
f (generic function with 1 method)

julia> @btime f(1000, Q);
  5.228 μs (1 allocation: 32 bytes)
4 Likes

What’s the definition of Modulus now? Because having a mutable struct with const fields makes no sense.

It stores the values you need to perform montgomery reduction. It is complete nonsense, but interestingly it enhances the code performance. My theory is, since mutable struct is stored in the heap, you can access it more faster.

it does. it’s dumb but it’s a way to force the stuict to not be inlined

1 Like

Huh? Is this the compiler failing at optimization?

kind of. if you make the struct immutable, it will inline the fields which leads to having to copy a lot more stuff around.

Probably.

To be precise, I am implementing NTT using ModNum and Modulus above, and for dimension N=1024, the benchmarks for NTT and iNTT is 21.500 μs and 25.125 μs if I make Modulus mutable. If I make Modulus immutable, the timings are 27.458 μs and 27.875 μs, respectively.

My goal is to make this more faster. If I use UInt64 instead of ModNum, the benchmarks are 12.041 μs and 15.791 μs, but the readability and the usuability is not so very good.

it’s your ntt allocating where it shouldn’t? it’s the code small enough to post? also, how fast is it if you just use regular modular arithmetic?

It does not allocate at all, but it is slow. And unfortunately, the code is too long to post. :frowning: I will check if I could make it more compact.