Function parameter works faster than type parameter

fjarri · March 19, 2019, 5:37am

Consider the following code:

using Random
using BenchmarkTools


@inline function testfunc1(x::UInt64, y::UInt64, p::UInt64, ::Val{P}) where P
    x * y * p
end

@inline function testfunc2(x::UInt64, y::UInt64, p::UInt64, ::Val{P}) where P
    x * y * P
end


function batched_mul1(res, x, y, p, tp)
    @inbounds for i in 1:length(x)
        res[i] = testfunc1(x[i], y[i], p, tp)
    end
end

function batched_mul2(res, x, y, p, tp)
    @inbounds for i in 1:length(x)
        res[i] = testfunc2(x[i], y[i], p, tp)
    end
end


rng = MersenneTwister(123)

batch = 1000000

p = UInt64(576460752308273153)
x = rand(rng, UInt64(1):p-1, batch)
y = rand(rng, UInt64(1):p-1, batch)

res1 = similar(x)
res2 = similar(x)

tp = Val(p)

display(@benchmark batched_mul1($res1, $x, $y, $p, $tp))
println()

display(@benchmark batched_mul2($res2, $x, $y, $p, $tp))
println()

It consistently shows that the second function is 30-40us slower than the first one (as compared to the 1ms total runtime). In the actual code which I was trying to simplify here the difference is more significant, and reaches ~10%.

The use case is as follows: I have a “modulo integer type”, say, Modulo{T, M} with the only field val :: T, where T is an integer type, M is an integer (of type T), and operators for it are defined as

Base.:*(x::Modulo{T, M}, y::Modulo{T, M}) where {T, M} = 
    Modulo{T,M}(mulmod(x.val, y.val, M))

The problem described above leads to mulmod used by itself working faster than the surrounding operator. Naturally, I would prefer to use the operator, since it’s more convenient, and I don’t have to carry the modulus around.

Is it some limitation of Julia type system (I suspect it may have something to do with M somehow being dynamically converted to T, despite being known at JIT-compilation time)? Or am I doing something wrong?

Elrod · March 19, 2019, 6:47am

I can reproduce the time difference.

But the only difference I see in the @code_native loop body is that the instructions are in a slightly different order in both versions.

EDIT:
Specifically, the version not uses the type parameter:

vpbroadcastq	%xmm0, %ymm0
vpsrlq	$32, %ymm0, %ymm1
nopw	%cs:(%rax,%rax)
; │ @ REPL[83]:3 within `batched_mul1'
; │┌ @ array.jl:728 within `getindex'
L144:
vmovdqu	(%r10,%rsi,8), %ymm2
vmovdqu	(%rdx,%rsi,8), %ymm3
; │└
; │┌ @ REPL[81]:2 within `testfunc1'
; ││┌ @ operators.jl:529 within `*' @ int.jl:54
vpsrlq	$32, %ymm2, %ymm5
vpmuludq	%ymm1, %ymm2, %ymm4
vpmuludq	%ymm0, %ymm2, %ymm2
vpmuludq	%ymm0, %ymm5, %ymm5
vpaddq	%ymm5, %ymm4, %ymm4
vpsllq	$32, %ymm4, %ymm4
vpaddq	%ymm4, %ymm2, %ymm2
vpsrlq	$32, %ymm3, %ymm4
vpsrlq	$32, %ymm2, %ymm5
vpmuludq	%ymm4, %ymm2, %ymm4
vpmuludq	%ymm3, %ymm2, %ymm2
vpmuludq	%ymm3, %ymm5, %ymm5
vpaddq	%ymm5, %ymm4, %ymm4
vpsllq	$32, %ymm4, %ymm4
vpaddq	%ymm4, %ymm2, %ymm2
; │└└
; │┌ @ array.jl:766 within `setindex!'
vmovdqu	%ymm2, (%rdi,%rsi,8)
addq	$4, %rsi
cmpq	%rsi, %r9
jne	L144

The first two instructions:

broadcast p into ymm0
vpsrlq to shift ymm0 right by 32 bits, storing those results in ymm1
The loop body starts at “L144:”
The loop loads x and y into ymm3 and ymm4
The loop does a bunch of multiplications with ymm0 and ymm1…
Does some shifts
Does another set of multiplications.

For the slow version of the function, “5” and “7” seem to have swapped places:

vpbroadcastq	(%rcx), %ymm0
vpbroadcastq	(%r10), %ymm1
andq	$-4, %r8
leaq	1(%r8), %rdi
; │ @ REPL[84]:3 within `batched_mul2'
; │┌ @ array.jl:728 within `getindex'
L144:
vmovdqu	(%rbx,%rax,8), %ymm2
vmovdqu	(%rdx,%rax,8), %ymm3
; │└
; │┌ @ REPL[82]:2 within `testfunc2'
; ││┌ @ operators.jl:529 within `*' @ int.jl:54
vpsrlq	$32, %ymm2, %ymm4
vpsrlq	$32, %ymm3, %ymm5
vpmuludq	%ymm5, %ymm2, %ymm5
vpmuludq	%ymm3, %ymm4, %ymm4
vpmuludq	%ymm3, %ymm2, %ymm2
vpaddq	%ymm4, %ymm5, %ymm4
vpsllq	$32, %ymm4, %ymm4
vpaddq	%ymm4, %ymm2, %ymm2
vpsrlq	$32, %ymm2, %ymm4
vpmuludq	%ymm1, %ymm2, %ymm3
vpmuludq	%ymm0, %ymm2, %ymm2
vpmuludq	%ymm0, %ymm4, %ymm4
vpaddq	%ymm4, %ymm3, %ymm3
vpsllq	$32, %ymm3, %ymm3
vpaddq	%ymm3, %ymm2, %ymm2
; │└└
; │┌ @ array.jl:766 within `setindex!'
vmovdqu	%ymm2, (%rsi,%rax,8)
addq	$4, %rax
cmpq	%rax, %r8
jne	L144

My guess then is that this version is slower, because it has to wait longer before it can start the first set of multiplications. For the fast version, because ymm0 and ymm1 are already loaded, so it can proceed with less latency / take better advantage of out of order execution / etc.

I’d blame LLVM, not Julia.

fjarri · March 19, 2019, 12:39pm

Thank you for the analysis, I guess I’ll just have to leave it as is and hope for an improvement in LLVM.

Topic		Replies	Views
Bad performance from parametric struct dispatch Performance	9	143	July 3, 2024
Why using a mutable struct type argument to create instances creates a 50x slowdown? Performance question , type , function	7	521	June 25, 2023
Testing performance of functions that are 'parametric over input collections' New to Julia question , parametric-types , function	6	433	September 7, 2020
Benchmarking with interpolated variable faster than hardcoded literal value Performance	1	551	September 18, 2019
Float64 comparison operator performance Performance	8	1044	September 26, 2019

Function parameter works faster than type parameter

Related topics