How to implement Lionel Zoubritzky's removetype() function?

I am so impressed by the clarity of @liozou’s presentation of the Bezanson et al paper “Julia: Dynamism and Performance Reconciled by Design”. Unfortunately I can’t seem to find a pdf of the slides, and I would love to replicate his example of the cost of devirtualization and inlining.

It relies on the function removetype which I believe must be his own work, and what it does, it seems to me, is that removetype(x) returns a value which has exactly the same data as x but the metadata has been changed so the type is now Any.

Any idea how one might write such a function?

That is not possible, since all values in Julia have a concrete type and Any is abstract.

But perhaps you want a function removetype(x) which returns x but which Julia’s type inference cannot figure out the return type?

If so, the following should work.

removetype(x) = Base.invokelatest(identity, x)
5 Likes

Thanks for the suggestion! Works perfectly.

Hi there! Thank you for the kind word :smile:
That talk is a bit old but still mostly accurate I believe! The way I implemented it at the time was

function removetype(x)
    anyref = Ref{Any}(x)
    return anyref[]
end

but there is now the Base.inferencebarrier function which is probably what you are looking for.
The previous answer with Base.invokelatest also works perfectly, but it incurs an additional cost at runtime (you actually have to look for the latest identity function each time). Maybe you don’t care about that at all, but it is important to note if you want to measure the impact of lacking accurate type inference alone.

4 Likes

I would actually fairly strongly discourage any method other than Base.inferencebarrier for this since went other method may break in the future as the compiler gets smarter.

5 Likes

:+1: In particular, the implementation of that function used to be what @Liozou proposed above, but then the compiler got smarter and that implementation was no longer effective. Putting in an appropriately scoped barrier will prevent future versions from breaking your code.

3 Likes

I am still a bit puzzled. If I redefine as follows:
removetype(x) = Base.inferencebarrier(identity, x)
then the result of @code_native is far less impressively bad:

julia> @code_native debuginfo=:none no_type(3)
	.text
	.file	"no_type"
	.globl	julia_no_type_735               # -- Begin function julia_no_type_735
	.p2align	4, 0x90
	.type	julia_no_type_735,@function
julia_no_type_735:                      # @julia_no_type_735
	.cfi_startproc
# %bb.0:                                # %top
	subq	$8, %rsp
	.cfi_def_cfa_offset 16
	movabsq	$j_removetype_737, %rax
	callq	*%rax
	ud2
.Lfunc_end0:
	.size	julia_no_type_735, .Lfunc_end0-julia_no_type_735
	.cfi_endproc
                                        # -- End function
	.section	".note.GNU-stack","",@progbits

just two extra operators as opposed to maybe 20 when removetype uses inferencebarrier… or am I missing something?

(the with_type call used only imulq and retq while no_type uses subq
movabsq, callq and ud2 … I have no clear idea what these are, presumably they are machine instructions?)

callq means it’s calling some other code. You can’t judge the complexity of the machine code without knowing what the complexity of the code that’s it’s calling is.

1 Like

Yes, I wondered about that. But since I need an example that is demonstrably more complex, I will use the one relying on invokelatest at the risk of it going stale.

invokelatest will be at least 10x slower in the best case.

Keno, I really appreciate your comments and please excuse me for taking so much of your time!

For the audience I am speaking to, a careful and honest and detailed account is not as important as in a very brief way illustrating the effect of not being able to infer type.

But this also illustrates for me how difficult the whole topic of “Why is Julia fast?” (or better, “Why is Julia not slow?”) is for non-experts like me. For example duck typing (as I understand the term) will probably correctly detect the type of almost all 64-bit values in normal usage. OTOH I guess duck typing comes with overhead as well

How about the following example then, without invoking invokelatest’s trickery:

julia> function my_operation_typed(x)
           result = 0
           for i in 1:x
               result += i*x
           end
           return result
       end
my_operation_typed (generic function with 1 method)

julia> function my_operation_untyped(input)
           x = Base.inferencebarrier(input)
           result = 0
           for i in 1:x
               result += i*x
           end
           return result
       end
my_operation_untyped (generic function with 1 method)

julia> @code_native debuginfo=:none dump_module=false my_operation_typed(4)
	.text
	testq	%rdi, %rdi
	jle	L47
	movq	%rdi, %rax
	sarq	$63, %rax
	andnq	%rdi, %rax, %rax
	leaq	-1(%rax), %rdx
	leaq	-2(%rax), %rcx
	mulxq	%rcx, %rcx, %rdx
	shldq	$63, %rcx, %rdx
	leaq	(%rdx,%rax,2), %rax
	decq	%rax
	imulq	%rdi, %rax
	retq
L47:
	xorl	%eax, %eax
	retq
	nopw	%cs:(%rax,%rax)

julia> @code_native debuginfo=:none dump_module=false my_operation_untyped(4)
	.text
	pushq	%rbp
	movq	%rsp, %rbp
	pushq	%r15
	pushq	%r14
	pushq	%r13
	pushq	%r12
	pushq	%rbx
	andq	$-32, %rsp
	subq	$160, %rsp
	movq	%rdi, %r14
	vxorps	%xmm0, %xmm0, %xmm0
	vmovups	%ymm0, 56(%rsp)
	vmovaps	%ymm0, 32(%rsp)
	movq	%fs:0, %rax
	movq	-8(%rax), %rcx
	movq	$20, 32(%rsp)
	movq	(%rcx), %rax
	movq	%rax, 40(%rsp)
	leaq	32(%rsp), %rax
	movq	%rax, (%rcx)
	movq	%rdi, %rax
	sarq	$63, %rax
	andnq	%rdi, %rax, %rbx
	movq	%rcx, 112(%rsp)
	movq	16(%rcx), %rdi
	movabsq	$ijl_gc_pool_alloc, %rax
	movl	$1440, %esi                     # imm = 0x5A0
	movl	$32, %edx
	vzeroupper
	callq	*%rax
	movq	%rax, %r13
	movabsq	$jl_system_image_data, %rax
	movq	%rax, -8(%r13)
	movq	$1, (%r13)
	movq	%rbx, 8(%r13)
	movq	%r13, 72(%rsp)
	movq	%r13, 16(%rsp)
	movabsq	$ijl_apply_generic, %rax
	movabsq	$jl_system_image_data, %rdi
	leaq	16(%rsp), %rsi
	movl	$1, %edx
	callq	*%rax
	movabsq	$140278254899208, %r15          # imm = 0x7F95138D9008
	cmpq	%r15, %rax
	je	L476
	movq	%rax, %r12
	movabsq	$140278254932000, %rbx          # imm = 0x7F95138E1020
	movq	%r13, 128(%rsp)
	movq	%r14, 120(%rsp)
	nopw	%cs:(%rax,%rax)
L256:
	movq	%rbx, 64(%rsp)
	movq	%r12, 48(%rsp)
	movq	%r12, %rdi
	xorl	%esi, %esi
	movabsq	$ijl_get_nth_field_checked, %r13
	callq	*%r13
	movq	%rax, %r14
	movq	%rax, 56(%rsp)
	movl	$1, %esi
	movq	%r12, %rdi
	callq	*%r13
	movq	%rax, %r12
	movq	%rax, 80(%rsp)
	movq	120(%rsp), %rdi
	movabsq	$ijl_box_int64, %rax
	callq	*%rax
	movq	%rax, 48(%rsp)
	movq	%r14, 16(%rsp)
	movq	%rax, 24(%rsp)
	movabsq	$jl_system_image_data, %rdi
	leaq	16(%rsp), %r13
	movq	%r13, %rsi
	movl	$2, %edx
	movabsq	$ijl_apply_generic, %r14
	callq	*%r14
	movq	%rax, 48(%rsp)
	movq	%rbx, 16(%rsp)
	movq	%rax, 24(%rsp)
	movabsq	$jl_system_image_data, %rdi
	movq	%r13, %rsi
	movl	$2, %edx
	callq	*%r14
	movq	%rax, %rbx
	movq	%rax, 48(%rsp)
	movq	128(%rsp), %rax
	movq	%rax, 16(%rsp)
	movq	%r12, 24(%rsp)
	movabsq	$jl_system_image_data, %rdi
	movq	%r13, %rsi
	movl	$2, %edx
	callq	*%r14
	movq	%rax, %r12
	cmpq	%r15, %rax
	jne	L256
	jmp	L486
L476:
	movabsq	$140278254932000, %rbx          # imm = 0x7F95138E1020
L486:
	movq	40(%rsp), %rax
	movq	112(%rsp), %rcx
	movq	%rax, (%rcx)
	movq	%rbx, %rax
	leaq	-40(%rbp), %rsp
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
	retq
	nopw	%cs:(%rax,%rax)

julia> using BenchmarkTools

julia> @benchmark my_operation_typed($1000)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.160 ns … 19.701 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.170 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.243 ns ±  0.463 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▄▂                                ▂▅▂                 ▃▃ ▁
  █████▃▃▃▁▁▁▁▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅███▇▅▁▁▁▁▁▁▁▁▁▁▁▁▁▆▅██ █
  3.16 ns      Histogram: log(frequency) by time      3.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark my_operation_untyped($1000)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   95.082 μs …  3.939 ms  ┊ GC (min … max): 0.00% … 95.81%
 Time  (median):      98.370 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   103.553 μs ± 67.072 μs  ┊ GC (mean ± σ):  2.68% ±  4.20%

   ▂▃▆██▇▆▃▁   ▁▁▂▄▄▄▄▃▂▁▃▃▃▂▁                                 ▂
  ▇██████████▇█████████████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▅▅▆▆▆▅▃▄▁▅▅▃ █
  95.1 μs       Histogram: log(frequency) by time       128 μs <

 Memory estimate: 93.44 KiB, allocs estimate: 4979.

I believe it’s fairly illustrative :wink: And even the 3 ns of the benchmark are likely benchmark noise, so the x30000 performance increase could be even higher…

But basically, not being able to infer types means that every function call involving an untyped value needs to be dynamically dispatched, which involves looking at the method tables and picking the most precise ones among those that subtype the signature of the call… All of which takes a lot of time. And this is required even for very basic operations like + or *, since the compiler does not know that the involved values are simply Int for instance.

Fortunately, duck typing, which I understand as simply not specifying types, usually leads to type-inferrable calls, the only caveats are described at Performance Tips · The Julia Language. Among them are global untyped variables (they will thwart type inference as soon as they are encountered), as well as struct with untyped or abstractly typed fields. I think we can add untyped empty arrays (just using [] without a nice type declaration like Int[] or Union{Float64,Missing}[]) and these should account for a large portion of lost performance by newcomers to Julia.

@liozou, yes, that is an excellent example!

Unfortunately, I go live with my talk in 50 mins, but thanks for sharing so generously!

1 Like

With pleasure, good luck with your presentation then!