Function parameter works faster than type parameter

Consider the following code:

using Random
using BenchmarkTools


@inline function testfunc1(x::UInt64, y::UInt64, p::UInt64, ::Val{P}) where P
    x * y * p
end

@inline function testfunc2(x::UInt64, y::UInt64, p::UInt64, ::Val{P}) where P
    x * y * P
end


function batched_mul1(res, x, y, p, tp)
    @inbounds for i in 1:length(x)
        res[i] = testfunc1(x[i], y[i], p, tp)
    end
end

function batched_mul2(res, x, y, p, tp)
    @inbounds for i in 1:length(x)
        res[i] = testfunc2(x[i], y[i], p, tp)
    end
end


rng = MersenneTwister(123)

batch = 1000000

p = UInt64(576460752308273153)
x = rand(rng, UInt64(1):p-1, batch)
y = rand(rng, UInt64(1):p-1, batch)

res1 = similar(x)
res2 = similar(x)

tp = Val(p)

display(@benchmark batched_mul1($res1, $x, $y, $p, $tp))
println()

display(@benchmark batched_mul2($res2, $x, $y, $p, $tp))
println()

It consistently shows that the second function is 30-40us slower than the first one (as compared to the 1ms total runtime). In the actual code which I was trying to simplify here the difference is more significant, and reaches ~10%.

The use case is as follows: I have a “modulo integer type”, say, Modulo{T, M} with the only field val :: T, where T is an integer type, M is an integer (of type T), and operators for it are defined as

Base.:*(x::Modulo{T, M}, y::Modulo{T, M}) where {T, M} = 
    Modulo{T,M}(mulmod(x.val, y.val, M))

The problem described above leads to mulmod used by itself working faster than the surrounding operator. Naturally, I would prefer to use the operator, since it’s more convenient, and I don’t have to carry the modulus around.

Is it some limitation of Julia type system (I suspect it may have something to do with M somehow being dynamically converted to T, despite being known at JIT-compilation time)? Or am I doing something wrong?

I can reproduce the time difference.

But the only difference I see in the @code_native loop body is that the instructions are in a slightly different order in both versions.

EDIT:
Specifically, the version not uses the type parameter:

vpbroadcastq	%xmm0, %ymm0
vpsrlq	$32, %ymm0, %ymm1
nopw	%cs:(%rax,%rax)
; │ @ REPL[83]:3 within `batched_mul1'
; │┌ @ array.jl:728 within `getindex'
L144:
vmovdqu	(%r10,%rsi,8), %ymm2
vmovdqu	(%rdx,%rsi,8), %ymm3
; │└
; │┌ @ REPL[81]:2 within `testfunc1'
; ││┌ @ operators.jl:529 within `*' @ int.jl:54
vpsrlq	$32, %ymm2, %ymm5
vpmuludq	%ymm1, %ymm2, %ymm4
vpmuludq	%ymm0, %ymm2, %ymm2
vpmuludq	%ymm0, %ymm5, %ymm5
vpaddq	%ymm5, %ymm4, %ymm4
vpsllq	$32, %ymm4, %ymm4
vpaddq	%ymm4, %ymm2, %ymm2
vpsrlq	$32, %ymm3, %ymm4
vpsrlq	$32, %ymm2, %ymm5
vpmuludq	%ymm4, %ymm2, %ymm4
vpmuludq	%ymm3, %ymm2, %ymm2
vpmuludq	%ymm3, %ymm5, %ymm5
vpaddq	%ymm5, %ymm4, %ymm4
vpsllq	$32, %ymm4, %ymm4
vpaddq	%ymm4, %ymm2, %ymm2
; │└└
; │┌ @ array.jl:766 within `setindex!'
vmovdqu	%ymm2, (%rdi,%rsi,8)
addq	$4, %rsi
cmpq	%rsi, %r9
jne	L144

The first two instructions:

  1. broadcast p into ymm0
  2. vpsrlq to shift ymm0 right by 32 bits, storing those results in ymm1
  3. The loop body starts at “L144:”
  4. The loop loads x and y into ymm3 and ymm4
  5. The loop does a bunch of multiplications with ymm0 and ymm1…
  6. Does some shifts
  7. Does another set of multiplications.

For the slow version of the function, “5” and “7” seem to have swapped places:

vpbroadcastq	(%rcx), %ymm0
vpbroadcastq	(%r10), %ymm1
andq	$-4, %r8
leaq	1(%r8), %rdi
; │ @ REPL[84]:3 within `batched_mul2'
; │┌ @ array.jl:728 within `getindex'
L144:
vmovdqu	(%rbx,%rax,8), %ymm2
vmovdqu	(%rdx,%rax,8), %ymm3
; │└
; │┌ @ REPL[82]:2 within `testfunc2'
; ││┌ @ operators.jl:529 within `*' @ int.jl:54
vpsrlq	$32, %ymm2, %ymm4
vpsrlq	$32, %ymm3, %ymm5
vpmuludq	%ymm5, %ymm2, %ymm5
vpmuludq	%ymm3, %ymm4, %ymm4
vpmuludq	%ymm3, %ymm2, %ymm2
vpaddq	%ymm4, %ymm5, %ymm4
vpsllq	$32, %ymm4, %ymm4
vpaddq	%ymm4, %ymm2, %ymm2
vpsrlq	$32, %ymm2, %ymm4
vpmuludq	%ymm1, %ymm2, %ymm3
vpmuludq	%ymm0, %ymm2, %ymm2
vpmuludq	%ymm0, %ymm4, %ymm4
vpaddq	%ymm4, %ymm3, %ymm3
vpsllq	$32, %ymm3, %ymm3
vpaddq	%ymm3, %ymm2, %ymm2
; │└└
; │┌ @ array.jl:766 within `setindex!'
vmovdqu	%ymm2, (%rsi,%rax,8)
addq	$4, %rax
cmpq	%rax, %r8
jne	L144

My guess then is that this version is slower, because it has to wait longer before it can start the first set of multiplications. For the fast version, because ymm0 and ymm1 are already loaded, so it can proceed with less latency / take better advantage of out of order execution / etc.

I’d blame LLVM, not Julia.

Thank you for the analysis, I guess I’ll just have to leave it as is and hope for an improvement in LLVM.