# Function parameter works faster than type parameter

Consider the following code:

``````using Random
using BenchmarkTools

@inline function testfunc1(x::UInt64, y::UInt64, p::UInt64, ::Val{P}) where P
x * y * p
end

@inline function testfunc2(x::UInt64, y::UInt64, p::UInt64, ::Val{P}) where P
x * y * P
end

function batched_mul1(res, x, y, p, tp)
@inbounds for i in 1:length(x)
res[i] = testfunc1(x[i], y[i], p, tp)
end
end

function batched_mul2(res, x, y, p, tp)
@inbounds for i in 1:length(x)
res[i] = testfunc2(x[i], y[i], p, tp)
end
end

rng = MersenneTwister(123)

batch = 1000000

p = UInt64(576460752308273153)
x = rand(rng, UInt64(1):p-1, batch)
y = rand(rng, UInt64(1):p-1, batch)

res1 = similar(x)
res2 = similar(x)

tp = Val(p)

display(@benchmark batched_mul1(\$res1, \$x, \$y, \$p, \$tp))
println()

display(@benchmark batched_mul2(\$res2, \$x, \$y, \$p, \$tp))
println()
``````

It consistently shows that the second function is 30-40us slower than the first one (as compared to the 1ms total runtime). In the actual code which I was trying to simplify here the difference is more significant, and reaches ~10%.

The use case is as follows: I have a “modulo integer type”, say, `Modulo{T, M}` with the only field `val :: T`, where `T` is an integer type, `M` is an integer (of type `T`), and operators for it are defined as

``````Base.:*(x::Modulo{T, M}, y::Modulo{T, M}) where {T, M} =
Modulo{T,M}(mulmod(x.val, y.val, M))
``````

The problem described above leads to `mulmod` used by itself working faster than the surrounding operator. Naturally, I would prefer to use the operator, since it’s more convenient, and I don’t have to carry the modulus around.

Is it some limitation of Julia type system (I suspect it may have something to do with `M` somehow being dynamically converted to `T`, despite being known at JIT-compilation time)? Or am I doing something wrong?

I can reproduce the time difference.

But the only difference I see in the `@code_native` loop body is that the instructions are in a slightly different order in both versions.

EDIT:
Specifically, the version not uses the type parameter:

``````vpbroadcastq	%xmm0, %ymm0
vpsrlq	\$32, %ymm0, %ymm1
nopw	%cs:(%rax,%rax)
; │ @ REPL[83]:3 within `batched_mul1'
; │┌ @ array.jl:728 within `getindex'
L144:
vmovdqu	(%r10,%rsi,8), %ymm2
vmovdqu	(%rdx,%rsi,8), %ymm3
; │└
; │┌ @ REPL[81]:2 within `testfunc1'
; ││┌ @ operators.jl:529 within `*' @ int.jl:54
vpsrlq	\$32, %ymm2, %ymm5
vpmuludq	%ymm1, %ymm2, %ymm4
vpmuludq	%ymm0, %ymm2, %ymm2
vpmuludq	%ymm0, %ymm5, %ymm5
vpsllq	\$32, %ymm4, %ymm4
vpsrlq	\$32, %ymm3, %ymm4
vpsrlq	\$32, %ymm2, %ymm5
vpmuludq	%ymm4, %ymm2, %ymm4
vpmuludq	%ymm3, %ymm2, %ymm2
vpmuludq	%ymm3, %ymm5, %ymm5
vpsllq	\$32, %ymm4, %ymm4
; │└└
; │┌ @ array.jl:766 within `setindex!'
vmovdqu	%ymm2, (%rdi,%rsi,8)
cmpq	%rsi, %r9
jne	L144
``````

The first two instructions:

2. vpsrlq to shift ymm0 right by 32 bits, storing those results in ymm1
3. The loop body starts at “L144:”
4. The loop loads x and y into ymm3 and ymm4
5. The loop does a bunch of multiplications with ymm0 and ymm1…
6. Does some shifts
7. Does another set of multiplications.

For the slow version of the function, “5” and “7” seem to have swapped places:

``````vpbroadcastq	(%rcx), %ymm0
andq	\$-4, %r8
leaq	1(%r8), %rdi
; │ @ REPL[84]:3 within `batched_mul2'
; │┌ @ array.jl:728 within `getindex'
L144:
vmovdqu	(%rbx,%rax,8), %ymm2
vmovdqu	(%rdx,%rax,8), %ymm3
; │└
; │┌ @ REPL[82]:2 within `testfunc2'
; ││┌ @ operators.jl:529 within `*' @ int.jl:54
vpsrlq	\$32, %ymm2, %ymm4
vpsrlq	\$32, %ymm3, %ymm5
vpmuludq	%ymm5, %ymm2, %ymm5
vpmuludq	%ymm3, %ymm4, %ymm4
vpmuludq	%ymm3, %ymm2, %ymm2
vpsllq	\$32, %ymm4, %ymm4
vpsrlq	\$32, %ymm2, %ymm4
vpmuludq	%ymm1, %ymm2, %ymm3
vpmuludq	%ymm0, %ymm2, %ymm2
vpmuludq	%ymm0, %ymm4, %ymm4
vpsllq	\$32, %ymm3, %ymm3