Why is [x]or (a function call) so long/slow (not inlined)?


#1

[EDIT: you can ignore A. it was kind of a false alarm. B. is still a little interesting to me, in general, maybe not in the case of A. though.]

A.
I thought I would/could optimize uppercase (and others, I found in [PR] Julia standard library…):

First I tried changing to:

uppercase(c::Char) = isascii(c) ? ('a' <= c <= 'z' ? c | 0x20 : c) : Char(ccall(:utf8proc_toupper, UInt32, (UInt32,), c))

@code_native new_uppercase('p') # much longer, than original below in PR

Minimal code doesn’t help for xor unlike or:

@inline new_uppercase(c) = xor(c, 0x20) # this is still long/slow, but not with *or*, c | 0x20 # operators shouldn't be faster as just syntatic sugar?

@code_native new_uppercase(2)
lowercase(c::Char) = isascii(c) ? ('A' <= c <= 'Z' ? c + 0x20 : c) : Char(ccall(:utf8proc_tolower, UInt32, (UInt32,), c))
uppercase(c::Char) = isascii(c) ? ('a' <= c <= 'z' ? c - 0x20 : c) : Char(ccall(:utf8proc_toupper, UInt32, (UInt32,), c))
titlecase(c::Char) = isascii(c) ? ('a' <= c <= 'z' ? c - 0x20 : c) : Char(ccall(:utf8proc_totitle, UInt32, (UInt32,), c))

B.
For strings, not Char literals, should map be able to inline uppercase?:

new_uppercase(s::AbstractString) = @inbounds map(new_uppercase, s)

#2

Don’t use the length of @code_native as a proxy for performance. It is not representative of the performance of a given function at all. The way to measure this is by actually running it — @benchmark from BenchmarkTools.jl is particularly good.


#3

Please quote code using backticks.


#4

Although benchmarking is the best way to measure performance, it is sometimes useful to look at the compiler output (though @code_llvm is almost always more useful/readable than @code_native). If you do this, you see that the xor function (in Julia 0.6 … previously $) is compiled to a single instruction and is generally inlined when called from other functions:

julia> @code_llvm xor(3,2)
define i64 @jlsys_xor_45254(i64, i64) #0 !dbg !5 {
top:
  %2 = xor i64 %1, %0
  ret i64 %2

julia> @code_native xor(3,2)
	.section	__TEXT,__text,regular,pure_instructions
Filename: int.jl
	pushq	%rbp
	movq	%rsp, %rbp
Source line: 172
	xorq	%rsi, %rdi
	movq	%rdi, %rax
	popq	%rbp
	retq
Source line: 172
	nopl	(%rax)
}

julia> f(x) = xor(x, 0x20)
f (generic function with 1 method)

julia> @code_llvm f(UInt32('x'))
define i32 @julia_f_61856(i32) #0 !dbg !5 {
top:
  %1 = xor i32 %0, 32
  ret i32 %1
}

#5

Note that c | 0x20 won’t actually work (it will throw an exception) for c::Char. You need to do Char(UInt32(c) | 0x20) to convert it to/from an integer in order to do arithmetic like | or xor.


#6

Oh absolutely, @code_llvm and @code_native can be extremely valuable tools. But just counting the lines of assembly or LLVM IR is going to be wildly misleading.


#7

Thanks for answering, I actually got the code down to the one assembly instruction (orq, or xorq depending) instead of addition, as I wanted.

[I realized, xor isn’t in 0.5 only $… should have run the code…]

I however get an extra movq [EDIT: is in non-minimal example anyway] after that I do not want (and will probably figure out, maybe plus/minus was optimal after all then I abandon this part of the plan…):

julia> new_uppercase(c::Int64) = @compat xor(c, 1)

julia> @code_native new_uppercase(0)
	.text
Filename: REPL[29]
	pushq	%rbp
	movq	%rsp, %rbp
Source line: 1
	xorq	$1, %rdi
	movq	%rdi, %rax
	popq	%rbp
	retq
	nopl	(%rax)

[deleted slow code, as I posed better]


#8

Thanks for reminding me; but either I did, or somebody fixed my post(?). I’m new to discourse, not sure if possible, at least do not see it.


#9

Yes… but not only, this also applies to plus/minus… I got my code down to optimal, I’m not sure this ceremony should be needed… I guess some convert between Char and UIn32 would be “needed” for that?

OR probably not; that’s only for implicit casts, that was already happening. Shouldn’t [those] implicit casts be as fast as explicit?

I really like Julia’s (theoretical with no fuss [here], often in practice) “zero-overhead types”: I can really live with this, as I got:

julia> new_uppercase(c::Char) = isascii(c) ? ('a' <= c <= 'z' ? Char(xor(UInt32(c), 0x20)) : c) : Char(ccall(:utf8proc_toupper, UInt32, (UInt32,), c))

julia> @code_native new_uppercase('p')
	.text
Filename: REPL[84]
	pushq	%rbp
	movq	%rsp, %rbp
Source line: 1
	cmpl	$127, %edi
	jbe	L23
	movabsq	$utf8proc_toupper, %rax
	callq	*%rax
	popq	%rbp
	retq
L23:
	cmpl	$97, %edi
	jb	L33
	cmpl	$123, %edi
	jb	L37
L33:
	movl	%edi, %eax
	popq	%rbp
	retq
L37:
	xorl	$32, %edi
	movl	%edi, %eax
	popq	%rbp
	retq
	nopl	(%rax)


With plus (as in Julia currently) I get an extra (this is I guess what I think it is), not a big problem as not on the critical path(?):

L42:
	movabsq	$jl_throw, %rax
	movabsq	$140102149303264, %rdi  # imm = 0x7F6C12D76BE0
	callq	*%rax

#10

Isn’t this just returning the value? I am pretty sure there is no actual overhead here.


#11

I’m not sure what you’re refering to (original questoin didn’t include the, supposed long assembly).

Anyway I got the core tests down to no jumps (and then mispredictions or branch bubbles):

[Would Julia never be clever enough to see this? I think there might be a way to help it get at least partly three, with straightforward code.]

@inline new_uppercase(c::Char) = begin test1=('a' <= c); test2=(c <= 'z'); res = Char(@compat xor(UInt32(c), (test1&test2)<<5)); end

julia> new_uppercase('p')
'P'

julia> @code_native new_uppercase('p')
	.text
Filename: REPL[58]
	pushq	%rbp
	movq	%rsp, %rbp
Source line: 28
	leal	-97(%rdi), %eax
	cmpl	$26, %eax
	sbbl	%eax, %eax
Source line: 1
	andl	$32, %eax
	xorl	%edi, %eax
	popq	%rbp
	retq
	nopw	%cs:(%rax,%rax)

Before the compare for “between” needed:

	cmpl	$97, %edi
	jb	L36
	cmpl	$123, %edi
	jb	L40

#12
julia> @inline new_uppercase(c::Char) = begin test1=('a' <= c); test2=(c <= 'z'); res = Char(@compat xor(UInt32(c), (test1&test2)<<5)); return isascii(c) ? res : Char(ccall(:utf8proc_toupper, UInt32, (UInt32,), c)) end

c = rand(Char, 5000000);

julia> @time map(new_uppercase, c)
  0.005856 seconds (8 allocations: 19.074 MB)

julia> @time map(uppercase, c)
  0.024926 seconds (8 allocations: 19.074 MB) # for a bigger array, not much difference as running on memory speed off-cache

[This function isn’t super-important; this was just a Julia-experiment, I want to know similar can be applied elsewhere (where?) preferally if codegen can be changed to be better.]

Make a PR? Any (obvious) error in my code? Is a more readable (and slower, e.g. original) preferred? Maybe I can fix that, and the annoying “jb L26” (why is the “then” part taken as more likely? !isascii didn’t help…). I want ASCII (English) as the common case for a fast path:

julia> @code_native new_uppercase('p')
	.text
Filename: REPL[91]
	pushq	%rbp
	movq	%rsp, %rbp
Source line: 1
	cmpl	$128, %edi
	jb	L26
	movabsq	$utf8proc_toupper, %rax
	callq	*%rax
	popq	%rbp
	retq
Source line: 28
L26:
	leal	-97(%rdi), %eax
	cmpl	$26, %eax
	sbbl	%eax, %eax
Source line: 1
	andl	$32, %eax
	xorl	%edi, %eax
	popq	%rbp
	retq
	nopl	(%rax)

#13

One trivia question:

retq
nopl	(%rax)

Why is there a NOP (no operation) after RET[urn]?!


#14

#15

I’m not sure your link applies (and even if it did; I argue Julia should suppress on output, when at the end, to not confuse readers). I already thought of “to force memory alignment, to prevent hazards, to occupy a branch delay slot”, i.e. alignment (not “prevent hazards” that I’m not sure about…) as in an example there before “_main”:

662e0f1f840000000000    nopw    %cs:(%rax,%rax)
_main:

[Is that a 20-byte NOP, wow? I guess Icache-line alignment best… And it’s then 32-byte cache-lines? Not same as Dcache? Needs not be…]

My point is, for my code no code (of mine) follows the RETq, the NOP, in my case nopl, will never get “executed”… I believe “branch delay slot”; I’m pretty sure x86 doesn’t have them (and the WP article on them, doesn’t say they have).

Delay slots could still be a reason… If LLVM has them by default (generated, not as input), in case another backend needs them - but no currently supported CPUs by Julia have them… (e.g. neither PowerPC or ARM; but MIPS does). LLVM had no NOP after RET, so I guess the bug is in LLVM (it takes up at least Icache space… tracecache matter for x86?)

Compared to @code_native (with nopl), no kind of NOP with @code_llvm:

define i32 @julia_new_uppercase_70503(i32) #0 {
pass:
  %1 = icmp ugt i32 %0, 127
  br i1 %1, label %L, label %if

L:                                                ; preds = %pass
  %2 = call i32 inttoptr (i64 139713638493872 to i32 (i32)*)(i32 %0)
  ret i32 %2

if:                                               ; preds = %pass
  %.off = add i32 %0, -97
  %3 = icmp ult i32 %.off, 26
  %4 = zext i1 %3 to i32
  %5 = shl nuw nsw i32 %4, 5
  %6 = xor i32 %5, %0
  ret i32 %6
}

#16

Why wouldn’t my link apply? You asked why there was a nop after ret. The link answers it.

It would surely be more confusing to lie to the user of what assembly code is actually being generated.


#17

The link applies (or the reasoning), when your own code jumps to the code after the NOP, as your link shows. Ok, maybe the link strictly applies also in my case, explaining why it’s generated (but I believe may be in error) also at the end.

Your logic implies, really, that code_native should also show the code following - some unrelated code - maybe code yet to be compiled… Showing some other, already compiled code, would at least would be confusing… :slight_smile:

In other words, the gap between your function and a completely different one is none of your concern. If you want to see it, you can always grab the next disassembler…

It’s not hugely important for me to get rid of it… And it should be there, for alignment, whether or not shown.


#18

It applies even when there is no instruction after the nop, http://stackoverflow.com/questions/7912464/why-does-gcc-pad-functions-with-nops

No it doesn’t? The nop is a part of the assembly generated for this function, not some other function.


#19

Being concerned about “casual readers” of assembly code being confused seems… um, what’s a nice word for “absurd”? As @kristoffer.carlsson has pointed out, this is part of the assembly generated for the function in question. Anyone who’s really confused could spend a few minutes StackOverflow and figure it out.


#20

I might advise against following this rabbit hole too far. You’ll quickly discover lots of really unexpected corner cases and eventually decide that you’ll just prefer to accept whatever heuristics are in use by your debugger on your platform. Some highlight though anyways:

  • the end of a function isn’t defined on all platforms
  • the meaning of “function” is different depending on the linker being targeted
  • the body of the function doesn’t need be consecutive in memory
  • the body of the function doesn’t need to be entirely code
  • there is not a one-to-one correspondence between the lines of code in the source function and the assembly output. (compilers actually implement heuristics here that get output into the debug info, so they can get parsed by the debugger to help it guess what you meant.)
  • the assembly itself in a function is not entirely unambiguous. For example, x86 can’t really tell whether something is a loop, function call, or what the target of either will be. Other platforms have ambiguous (https://github.com/JuliaLang/julia/blob/794c0ac1068fd3683bb4027899001b9c4b5590d9/src/disasm.cpp#L692) encodings or have instructions whose meanings can differ depending on what coprocessor is present on your machine (https://github.com/JuliaLang/julia/blob/794c0ac1068fd3683bb4027899001b9c4b5590d9/src/disasm.cpp#L678).

…and that’s even before you consider the impact of inlining, out-lining, and other IPO optimizations.

But then again, if that sounds like your cup of tea, PRs for better heuristics are always welcome!