Consider the following code:
using Random
using BenchmarkTools
@inline function testfunc1(x::UInt64, y::UInt64, p::UInt64, ::Val{P}) where P
x * y * p
end
@inline function testfunc2(x::UInt64, y::UInt64, p::UInt64, ::Val{P}) where P
x * y * P
end
function batched_mul1(res, x, y, p, tp)
@inbounds for i in 1:length(x)
res[i] = testfunc1(x[i], y[i], p, tp)
end
end
function batched_mul2(res, x, y, p, tp)
@inbounds for i in 1:length(x)
res[i] = testfunc2(x[i], y[i], p, tp)
end
end
rng = MersenneTwister(123)
batch = 1000000
p = UInt64(576460752308273153)
x = rand(rng, UInt64(1):p-1, batch)
y = rand(rng, UInt64(1):p-1, batch)
res1 = similar(x)
res2 = similar(x)
tp = Val(p)
display(@benchmark batched_mul1($res1, $x, $y, $p, $tp))
println()
display(@benchmark batched_mul2($res2, $x, $y, $p, $tp))
println()
It consistently shows that the second function is 30-40us slower than the first one (as compared to the 1ms total runtime). In the actual code which I was trying to simplify here the difference is more significant, and reaches ~10%.
The use case is as follows: I have a “modulo integer type”, say, Modulo{T, M}
with the only field val :: T
, where T
is an integer type, M
is an integer (of type T
), and operators for it are defined as
Base.:*(x::Modulo{T, M}, y::Modulo{T, M}) where {T, M} =
Modulo{T,M}(mulmod(x.val, y.val, M))
The problem described above leads to mulmod
used by itself working faster than the surrounding operator. Naturally, I would prefer to use the operator, since it’s more convenient, and I don’t have to carry the modulus around.
Is it some limitation of Julia type system (I suspect it may have something to do with M
somehow being dynamically converted to T
, despite being known at JIT-compilation time)? Or am I doing something wrong?