Workflow for monitoring native code size

question
performance

#1

A good measure of performance is length of generated machine code. As a package writer, I would like to monitor how well the functions I am writing are in terms of concise assembly. This would improve the performance of my packages and loading times in the long-term.

For a practical example, consider the following alternatives of computing the maximum between two numbers:

maximum((1,2))
maximum([1,2])

At first, I guessed that the former option would be more efficient because tuples are immutable and the compiler could do all sorts of things with them. However, when I generate the code, this is what I get:

maximum((1,2))

julia> @code_native maximum((1,2))
	.text
Filename: reduce.jl
	pushq	%rbp
	movq	%rsp, %rbp
	pushq	%r15
	pushq	%r14
	pushq	%r12
	pushq	%rbx
	subq	$64, %rsp
	movq	%rdi, %r15
	movq	%fs:0, %rbx
	addq	$-10888, %rbx           # imm = 0xD578
	leaq	-64(%rbp), %r14
	vxorps	%ymm0, %ymm0, %ymm0
	vmovups	%ymm0, -64(%rbp)
	movq	$10, -88(%rbp)
	movq	(%rbx), %rax
	movq	%rax, -80(%rbp)
	leaq	-88(%rbp), %rax
	movq	%rax, (%rbx)
	movq	$0, -72(%rbp)
Source line: 454
	movabsq	$140402333763728, %r12  # imm = 0x7FB1F73AC890
	leaq	398277240(%r12), %rax
	movq	%rax, -64(%rbp)
	leaq	398277144(%r12), %rax
	movq	%rax, -56(%rbp)
	leaq	398277080(%r12), %rax
	movq	%rax, -48(%rbp)
	movabsq	$jl_gc_pool_alloc, %rax
	movl	$1456, %esi             # imm = 0x5B0
	movl	$32, %edx
	movq	%rbx, %rdi
	vzeroupper
	callq	*%rax
	leaq	397451744(%r12), %rcx
	movq	%rcx, -8(%rax)
	vmovups	(%r15), %xmm0
	vmovups	%xmm0, (%rax)
	movq	%rax, -40(%rbp)
	movabsq	$jl_invoke, %rax
	movl	$4, %edx
	movq	%r12, %rdi
	movq	%r14, %rsi
	callq	*%rax
	movq	%rax, -72(%rbp)
	movq	(%rax), %rax
	movq	-80(%rbp), %rcx
	movq	%rcx, (%rbx)
	addq	$64, %rsp
	popq	%rbx
	popq	%r12
	popq	%r14
	popq	%r15
	popq	%rbp
	retq
	nopw	%cs:(%rax,%rax)

maximum([1,2])

julia> @code_native maximum([1,2])
	.text
Filename: reduce.jl
	pushq	%rbp
	movq	%rsp, %rbp
Source line: 454
	callq	_mapreduce
	popq	%rbp
	retq
	nopl	(%rax,%rax)

So clearly, I cannot trust my intuition in many other cases. What is the workflow you suggest for tracking these types of changes? Is there any package to facilitate diagnostics? I wonder if something like __precompile()__ could be added to warn package writers whenever a function is re-implemented and causes giant machine code increase.

Related to this issue, it would be nice if I could start Julia in a “warn_type” mode. That is, every single command I type in the REPL gives me a warning if there is type instability. Adding @code_warntype everywhere by hand is not very efficient from the perspective of someone that is only interested in implemented a cool new feature in the package. I’d rather have the warning from the start than having to go back in a second pass to optimize code.


#2

A good measure of performance is length of generated machine code

I don’t think it’s a particularly good measure, and the example you’ve given demonstrates that. You claim that the shorter native code of the second method means it’s more “efficient”. But your second example includes a callq instruction, which is just a jump into another function which isn’t shown because it isn’t inlined. The length of the native code tells you nothing at all about how efficient that second call is, because it tells you nothing at all about what happens inside callq _mapreduce.

Beyond that, it’s trivial to write an arbitrarily small amount of assembly that will execute forever, and there cannot exist a general way of knowing how many instructions a given computation will require.

This just really seems like the wrong kind of thing to be concerned about. Why not just measure performance instead?


#3

Related to this issue, it would be nice if I could start Julia in a “warn_type” mode.

I have found myself wanting something like that in the past, but I think that enabling it everywhere would be pretty annoying. Just as an example, check out the Expr type (which holds all Julia expressions internally):

julia> dump(Expr)
Expr <: Any
  head::Symbol
  args::Array{Any,1}
  typ::Any

Expr includes an ::Any field and a ::Vector{Any} field, so any usage of Exprs (e.g. parsing, compiling, macros, etc.) would have to trigger type warnings. That’s probably too many warnings.

On the other hand, what about a macro that decorates a given block of code and warns if type instability happens inside that block? That might be similarly useful without as much noise.

Edit: for example, I bet one could write a macro that takes a function definition f(x, y) = <function body> and turns it into:

_f(x, y) = <original function body>

@generated function f(x, y)
  # check type-stability of `_f` with input types `x` and `y` using code_warntype
  if is_type_stable
    return :(_f($(x), $(y)))
  else
    warn("not type stable") # or you can make this an error, 
    # or you can put it inside the quoted expr so that it shows *every* time
    return :(_f($(x), $(y)))
  end
end

(although turning regular functions into @generated functions might have other performance consequences…)


#4

That would become very annoying very quickly (eg a call to plot would give a lot of warnings).

IMO the key to using Julia well is to strive for type stability for parts of the code which take most of the time, and allow higher level functions to be flexible. Type stability is a tool, not an end in itself.


#5

By the way, I just ran those two examples in Julia v0.6 and got:

julia> @btime maximum($((1,2)))
  27.585 ns (1 allocation: 32 bytes)
2

julia> @btime maximum($([1,2]))
  5.288 ns (0 allocations: 0 bytes)
2

but on nightly, I get:

julia> @btime maximum($((1,2)))
  3.448 ns (0 allocations: 0 bytes)
2

julia> @btime maximum($([1,2]))
  5.175 ns (0 allocations: 0 bytes)
2

So the Tuple version got 10X faster! Not strictly relevant to the discussion, but just more reasons to be excited for Julia 0.7/1.0 :fireworks:


#6

Not really.

No, you don’t. You want to measure the performance. Use BenchmarkTools.jl and perhaps PkgBenchmark.jl.


#7

Type instable code is very handy and there are good use-cases for it so a a function that warns of every type-instability is just not useful.

Secondly to paraphase @yuyichao

If you are not a compiler or a cpu, don’t look at @code_native.

The right thing for your usecase is to use BenchmarkTools and PkgBenchmarks to monitor performance of core function as we do with Base Julia.


#8

not *

I should add that looking at @code_native for inspiration is fine, but unless you are actually trying to fix a compiler issue it’s basically never helpful.


#9

Or check BenchmarkTool once again. :stuck_out_tongue:


#10

That would be awesome, very useful to have. :+1:


#11

Interesting, thanks for sharing. What about the machine code that is generated, does the ratio stays the same as in my example or it changed?


#12

The native code shows a call instruction in both cases on 0.7, so it’s not particularly informative.