How to construct a struct with common field values efficiently

lukemin · September 20, 2023, 2:44am

I am now developing a code which makes use of a huge number of modulo arithmetic.
For a better readability, I tried to define custom types Modulus and ModNum.

Modulus contains some values which is essential for the modulo arithmetic (for example, the value of the modulus, Q, or some pre-computed values for montgomery reduction, etc.) and ModNum stores the value and modulus. Namely,

mutable struct Modulus
    Q::UInt64
    Qinv::UInt64
    ...
end

struct ModNum
    val::UInt64
    Q::Modulus
end

. And then, I could define overloading operations such as Base.+, Base.- or Base.* for two ModNums with the same field value of Q. However, one downside with this approach is that it copies Q whenever new ModNum is generated, and it slows down the code and also consumes more memory. (Using UInt64 is twice as fast as this approach, but the readability and usability is a bit poor.) Will there be any good way to keep the structure while avoiding memory consumption from copying Q?

Oscar_Smith · September 20, 2023, 3:33am

Don’t make Modulus mutable. Once it’s an immutable struct, the copies will be (basically) free

lukemin · September 20, 2023, 4:28am

Actually I have already tried that, but making Modulus mutable makes the code ~30% faster and the size of ModNum is much smaller.

Oscar_Smith · September 20, 2023, 4:49am

Wait I’m reading this again. Why are you copying Q? You should be able to use the same Q as the inputs.

DNF · September 20, 2023, 5:06am

Non-mutable should be faster, but what’s hiding inside ... here? If you want performance advice it is necessary to share some runnable code, a minimal working example that demonstrates the issue.

lukemin · September 20, 2023, 5:06am

I was referring to using the same Q as input when generating ModNum. It works well, but it is not as efficient as not defining ModNum and use Q as input for the binary operations. (i.e. use UInt64 instead of ModNum and define functions like add(x::UInt64, y::UInt64, Q::Modulus))

lukemin · September 20, 2023, 5:18am

So, this is my code for Montgomery multiplication.

mutable struct Modulus
    const Q::UInt64
    const Q⁻¹::UInt64
    const R⁻¹::UInt128
    const mask::UInt128
    const logR::Int64

    function Modulus(Q::Integer)
        @assert isodd(Q) "Modulus should be an odd number."

        Q⁻¹ = UInt64(invmod(-Q, 0x00000000000000010000000000000000))
        R⁻¹ = UInt128(invmod(0x00000000000000010000000000000000, Q))
        new(Q, Q⁻¹, R⁻¹, 0x0000000000000000ffffffffffffffff, 64)
    end
end

Base.:mod(x::Integer, Q::Modulus) = mod(x, Q.Q)
Base.:invmod(x::Integer, Q::Modulus) = invmod(UInt64(x), Q.Q)

struct ModNum <: Unsigned
    val::UInt64
    Q::Modulus
end

mform(x::ModNum) = ModNum(UInt64(mod(UInt128(mod(x.val, x.Q)) << x.Q.logR, x.Q)), x.Q)
imform(x::ModNum) = ModNum(UInt64(mod(x.val * x.Q.R⁻¹, x.Q)), x.Q)

add(a::UInt64, b::UInt64, Q::Modulus) = begin
    res = a + b
    if res ≥ Q.Q
        res -= Q.Q
    end
    res
end

Base.:+(a::ModNum, b::ModNum) = begin
    @assert a.Q.Q == b.Q.Q
    ModNum(add(a.val, b.val, a.Q), a.Q)
end

Base.:-(a::ModNum) = ModNum(a.Q.Q - a.val, a.Q)

sub(a::UInt64, b::UInt64, Q::Modulus) = begin
    res = a - b
    if res ≥ Q.Q
        res += Q.Q
    end
    res
end

Base.:-(a::ModNum, b::ModNum) = begin
    @assert a.Q.Q == b.Q.Q
    ModNum(sub(a.val, b.val, a.Q), a.Q)
end

mul(a::UInt64, b::UInt64, Q::Modulus) = begin
    q, q⁻¹ = Q.Q, Q.Q⁻¹

    ab = UInt128(a) * UInt128(b)
    w = UInt64((ab + q * UInt128(UInt64(ab & Q.mask) * q⁻¹)) >> Q.logR)
    w ≥ q ? w - q : w
end

Base.:*(a::ModNum, b::ModNum) = begin
    @assert a.Q.Q == b.Q.Q
    ModNum(mul(a.val, b.val, a.Q), a.Q)
end

The test code is as follows:

Q = Modulus(113)
a, b = ModNum(1, Q), ModNum(3, Q);

@btime begin
    for i = 1 : 1000
         a * b;
    end
end
12.292 μs (1000 allocations: 31.25 KiB)

@btime begin
    for i = 1 : 1000
        mul(0x0000000000000001, 0x0000000000000003, Q);
    end
end
2.181 μs (0 allocations: 0 bytes)

Of course using ModNum will be less efficient than just using UInt64, but I wonder if there is any nice way to keep away from (too many) allocations or make this construction faster.

Oscar_Smith · September 20, 2023, 5:34am

This is a benchmarking artifact (since a and b are globals)

function f(n, A)
    a = ModNum(1,Q)
    for i in 1:n
        a *= ModNum(i,Q)
    end
    a
end
@btime f(1000, a);

you should see no allocations

EDIT: actually I am seeing the allocations. That’s very odd…

Oscar_Smith · September 20, 2023, 5:35am

Also, once you get this working, I would very much appreciate if you PR’d it to GitHub - JuliaMath/IntegerMathUtils.jl since Primes.jl could be a good bit faster with a good implementation of Montgomery multiplication.

lukemin · September 20, 2023, 5:43am

It gives me 1491 allocations in my laptop (macbook).

@btime f(1000, a);
  98.250 μs (1491 allocations: 38.95 KiB)

Regarding the correctness of the code, I opted out the unnecessary parts. Updated the code above so that it guarantees correctness. You need to call mform(a) to make the integer into the montgomery form, and call imform(a) to make it into the normal form.

Salmon · September 20, 2023, 5:49am

Perhaps I am missing something but isn’t it simply that Q still is an untyped global?

Oscar_Smith · September 20, 2023, 5:51am

You are missing the fact that I’m trying to help someone debug stuff at 2am

julia> function f(n, Q)
           a = ModNum(1,Q)
           for i in 1:n
               a *= ModNum(i,Q)
           end
           a
       end
f (generic function with 1 method)

julia> @btime f(1000, Q);
  5.228 μs (1 allocation: 32 bytes)

DNF · September 20, 2023, 5:54am

What’s the definition of Modulus now? Because having a mutable struct with const fields makes no sense.

lukemin · September 20, 2023, 5:57am

It stores the values you need to perform montgomery reduction. It is complete nonsense, but interestingly it enhances the code performance. My theory is, since mutable struct is stored in the heap, you can access it more faster.

Oscar_Smith · September 20, 2023, 5:57am

it does. it’s dumb but it’s a way to force the stuict to not be inlined

DNF · September 20, 2023, 6:01am

Huh? Is this the compiler failing at optimization?

Oscar_Smith · September 20, 2023, 6:12am

kind of. if you make the struct immutable, it will inline the fields which leads to having to copy a lot more stuff around.

lukemin · September 20, 2023, 6:17am

Probably.

To be precise, I am implementing NTT using ModNum and Modulus above, and for dimension N=1024, the benchmarks for NTT and iNTT is 21.500 μs and 25.125 μs if I make Modulus mutable. If I make Modulus immutable, the timings are 27.458 μs and 27.875 μs, respectively.

My goal is to make this more faster. If I use UInt64 instead of ModNum, the benchmarks are 12.041 μs and 15.791 μs, but the readability and the usuability is not so very good.

Oscar_Smith · September 20, 2023, 6:24am

it’s your ntt allocating where it shouldn’t? it’s the code small enough to post? also, how fast is it if you just use regular modular arithmetic?

lukemin · September 20, 2023, 6:25am

It does not allocate at all, but it is slow. And unfortunately, the code is too long to post. I will check if I could make it more compact.

Topic		Replies	Views
Mutable vs immutable struct: modifying an array field General Usage struct , mutable-structure	4	1812	February 22, 2023
Structs: mutable versus immutable New to Julia question , struct	22	12659	January 25, 2021
Question Julia best practices for structs of numbers General Usage performance , struct	7	405	January 23, 2024
Performance advice needed Performance fftw	17	1323	January 17, 2020
Mutable ProtoStruct beats Base Mutable Struct in microbenchmark...and then it is much slower in real program Performance question , package , benchmark	9	371	November 9, 2023

How to construct a struct with common field values efficiently

Related topics