How about the following example then, without invoking invokelatest
’s trickery:
julia> function my_operation_typed(x)
result = 0
for i in 1:x
result += i*x
end
return result
end
my_operation_typed (generic function with 1 method)
julia> function my_operation_untyped(input)
x = Base.inferencebarrier(input)
result = 0
for i in 1:x
result += i*x
end
return result
end
my_operation_untyped (generic function with 1 method)
julia> @code_native debuginfo=:none dump_module=false my_operation_typed(4)
.text
testq %rdi, %rdi
jle L47
movq %rdi, %rax
sarq $63, %rax
andnq %rdi, %rax, %rax
leaq -1(%rax), %rdx
leaq -2(%rax), %rcx
mulxq %rcx, %rcx, %rdx
shldq $63, %rcx, %rdx
leaq (%rdx,%rax,2), %rax
decq %rax
imulq %rdi, %rax
retq
L47:
xorl %eax, %eax
retq
nopw %cs:(%rax,%rax)
julia> @code_native debuginfo=:none dump_module=false my_operation_untyped(4)
.text
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
andq $-32, %rsp
subq $160, %rsp
movq %rdi, %r14
vxorps %xmm0, %xmm0, %xmm0
vmovups %ymm0, 56(%rsp)
vmovaps %ymm0, 32(%rsp)
movq %fs:0, %rax
movq -8(%rax), %rcx
movq $20, 32(%rsp)
movq (%rcx), %rax
movq %rax, 40(%rsp)
leaq 32(%rsp), %rax
movq %rax, (%rcx)
movq %rdi, %rax
sarq $63, %rax
andnq %rdi, %rax, %rbx
movq %rcx, 112(%rsp)
movq 16(%rcx), %rdi
movabsq $ijl_gc_pool_alloc, %rax
movl $1440, %esi # imm = 0x5A0
movl $32, %edx
vzeroupper
callq *%rax
movq %rax, %r13
movabsq $jl_system_image_data, %rax
movq %rax, -8(%r13)
movq $1, (%r13)
movq %rbx, 8(%r13)
movq %r13, 72(%rsp)
movq %r13, 16(%rsp)
movabsq $ijl_apply_generic, %rax
movabsq $jl_system_image_data, %rdi
leaq 16(%rsp), %rsi
movl $1, %edx
callq *%rax
movabsq $140278254899208, %r15 # imm = 0x7F95138D9008
cmpq %r15, %rax
je L476
movq %rax, %r12
movabsq $140278254932000, %rbx # imm = 0x7F95138E1020
movq %r13, 128(%rsp)
movq %r14, 120(%rsp)
nopw %cs:(%rax,%rax)
L256:
movq %rbx, 64(%rsp)
movq %r12, 48(%rsp)
movq %r12, %rdi
xorl %esi, %esi
movabsq $ijl_get_nth_field_checked, %r13
callq *%r13
movq %rax, %r14
movq %rax, 56(%rsp)
movl $1, %esi
movq %r12, %rdi
callq *%r13
movq %rax, %r12
movq %rax, 80(%rsp)
movq 120(%rsp), %rdi
movabsq $ijl_box_int64, %rax
callq *%rax
movq %rax, 48(%rsp)
movq %r14, 16(%rsp)
movq %rax, 24(%rsp)
movabsq $jl_system_image_data, %rdi
leaq 16(%rsp), %r13
movq %r13, %rsi
movl $2, %edx
movabsq $ijl_apply_generic, %r14
callq *%r14
movq %rax, 48(%rsp)
movq %rbx, 16(%rsp)
movq %rax, 24(%rsp)
movabsq $jl_system_image_data, %rdi
movq %r13, %rsi
movl $2, %edx
callq *%r14
movq %rax, %rbx
movq %rax, 48(%rsp)
movq 128(%rsp), %rax
movq %rax, 16(%rsp)
movq %r12, 24(%rsp)
movabsq $jl_system_image_data, %rdi
movq %r13, %rsi
movl $2, %edx
callq *%r14
movq %rax, %r12
cmpq %r15, %rax
jne L256
jmp L486
L476:
movabsq $140278254932000, %rbx # imm = 0x7F95138E1020
L486:
movq 40(%rsp), %rax
movq 112(%rsp), %rcx
movq %rax, (%rcx)
movq %rbx, %rax
leaq -40(%rbp), %rsp
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
retq
nopw %cs:(%rax,%rax)
julia> using BenchmarkTools
julia> @benchmark my_operation_typed($1000)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 3.160 ns … 19.701 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.170 ns ┊ GC (median): 0.00%
Time (mean ± σ): 3.243 ns ± 0.463 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▄▂ ▂▅▂ ▃▃ ▁
█████▃▃▃▁▁▁▁▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅███▇▅▁▁▁▁▁▁▁▁▁▁▁▁▁▆▅██ █
3.16 ns Histogram: log(frequency) by time 3.5 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark my_operation_untyped($1000)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 95.082 μs … 3.939 ms ┊ GC (min … max): 0.00% … 95.81%
Time (median): 98.370 μs ┊ GC (median): 0.00%
Time (mean ± σ): 103.553 μs ± 67.072 μs ┊ GC (mean ± σ): 2.68% ± 4.20%
▂▃▆██▇▆▃▁ ▁▁▂▄▄▄▄▃▂▁▃▃▃▂▁ ▂
▇██████████▇█████████████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▅▅▆▆▆▅▃▄▁▅▅▃ █
95.1 μs Histogram: log(frequency) by time 128 μs <
Memory estimate: 93.44 KiB, allocs estimate: 4979.
I believe it’s fairly illustrative
And even the 3 ns of the benchmark are likely benchmark noise, so the x30000 performance increase could be even higher…
But basically, not being able to infer types means that every function call involving an untyped value needs to be dynamically dispatched, which involves looking at the method tables and picking the most precise ones among those that subtype the signature of the call… All of which takes a lot of time. And this is required even for very basic operations like +
or *
, since the compiler does not know that the involved values are simply Int
for instance.
Fortunately, duck typing, which I understand as simply not specifying types, usually leads to type-inferrable calls, the only caveats are described at Performance Tips · The Julia Language. Among them are global untyped variables (they will thwart type inference as soon as they are encountered), as well as struct
with untyped or abstractly typed fields. I think we can add untyped empty arrays (just using []
without a nice type declaration like Int[]
or Union{Float64,Missing}[]
) and these should account for a large portion of lost performance by newcomers to Julia.