# Performance: replace conditional jump by some shift/and/or?

I work on a data type which stores several fields bit-packed into a container c :: Int64. There are two different storage layouts, depending on the most significant bit. A simplified example: if bit63 is 0, my bit-packed field f is stored in bit0…bit3 of c, otherwise in bit0…bit7.

An intuitive extraction code for f could be:

``````f1(c::Int64) = c>=0 ? c&7 : c&255
``````

Alternatively, I can avoid conditional jumps and construct the extraction bit mask by extending the sign bit:

``````f2(c::Int64) = c & ((c>>63) & 255 | 7)
``````

I thought the less intuitive f2 will be significantly faster, if positive and negative values of c occur randomly with similar probability. Reason: a modern CPU has to rebuild its micro-op instruction queue if branch prediction fails, which costs more than 3 additional elementary shift/and/or operations on registers.

I tried to verify that by simple benchmarking, but failed: no significant results, probably due to loop overhead (1 branch, 1 array access per iteration). Results below.

How could I obtain significant benchmark results? Or am I wrong in my assumption that conditional branches are way more expensive than a couple of elementary integer operations?

My test code:

``````f1(c::Int64) = c>=0 ? c&7 : c&255

f2(c::Int64) = c & ((c>>63) & 255 | 7)

function loopf1(data, cases)
result ::Int64 = 0
for i in 1:cases
result += f1(data[i])
end
result
end

function loopf2(data, cases)
result ::Int64 = 0
for i in 1:cases
result += f2(data[i])
end
result
end

function simple_benchmark(cases)
Random.seed!(1234)
data = rand(Int64,cases)
@time loopf1(data,cases)
@time loopf2(data,cases)
nothing
end

``````

My test results (quite slow notebook with Pentium 4415U, Windows 10 64-bit, Julia 1.0.5):

``````julia> simple_benchmark(2000000)
0.000872 seconds
0.001007 seconds

julia> simple_benchmark(2000000)
0.001912 seconds
0.001241 seconds

julia> simple_benchmark(2000000)
0.000886 seconds
0.001525 seconds

julia> simple_benchmark(20000000)
0.008750 seconds
0.013753 seconds

julia> simple_benchmark(20000000)
0.009302 seconds
0.008985 seconds

julia> simple_benchmark(20000000)
0.009369 seconds
0.008960 seconds

julia> simple_benchmark(20000000)
0.009334 seconds
0.009140 seconds

``````

I believe the two functions are identical; both are using bit-twiddling! Julia/LLVM can often turn branches into bit-twiddles or `ifelse` selects if it is obvious and advantageous. I think the differences you’re seeing are just caching/memory location effects. There’s also some overhead from bounds-checking mixed in; if you remove bounds-checks with `@inbounds` then it will SIMD the bit-twiddling — in both versions!

2 Likes

Aside: I recommend using BenchmarkTools.jl

1 Like

They are generating different code, and the shift is definitely slower on my i7. I’m using julia 1.2, and I altered some of the commented quotes in the generated code to help the syntax highlighter.

``````julia> c = 1
1

julia> @code_native f1(c)
.text
; ┌ @ timing.jl:1 within 'f1'
; │┌ @ operators.jl:341 within '>='
; ││┌ @ timing.jl:1 within '<='
testq	%rdi, %rdi
; │└└
js	L12
; │┌ @ int.jl:293 within '&'
andl	\$7, %edi
; │└
movq	%rdi, %rax
retq
; │┌ @ int.jl:293 within '&'
L12:
movzbl	%dil, %eax
; │└
retq
nopw	%cs:(%rax,%rax)
; └

julia> @code_native f2(c)
.text
; ┌ @ timing.jl:3 within 'f2'
; │┌ @ int.jl:444 within '>>' @ timing.jl:3
movq	%rdi, %rax
sarq	\$63, %rax
; │└
; │┌ @ int.jl:293 within '&'
andl	\$248, %eax
; │└
; │┌ @ int.jl:316 within '|'
leaq	7(%rax), %rax
; │└
; │┌ @ int.jl:293 within '&'
andq	%rdi, %rax
; │└
retq
nopw	%cs:(%rax,%rax)
; └
``````

Ah, but check out what happens when you put them in a loop

Example
``````julia> versioninfo()
Julia Version 1.4.0-DEV.374
Commit f806717255 (2019-10-24 17:01 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin18.7.0)
CPU: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

julia> f1(c::Int64) = c>=0 ? c&7 : c&255

f2(c::Int64) = c & ((c>>63) & 255 | 7)

function loop(f, data, cases)
result ::Int64 = 0
for i in 1:cases
result += f(data[i])
end
result
end
loop (generic function with 1 method)
``````
``````julia> @code_llvm debuginfo=:none loop(f1, Int[], 1)

define i64 @julia_loop_17302(%jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40), i64) {
top:
%2 = icmp sgt i64 %1, 0
br i1 %2, label %L7.L12_crit_edge, label %L35

L7.L12_crit_edge:                                 ; preds = %top
%5 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %4, i64 0, i32 1
%7 = icmp eq i64 %6, 0
br i1 %7, label %oob, label %idxend.lr.ph

idxend.lr.ph:                                     ; preds = %L7.L12_crit_edge
%8 = select i1 %2, i64 %1, i64 0
br label %idxend

L34:                                              ; preds = %idxend
%11 = add nuw i64 %value_phi414, 1
%12 = icmp ult i64 %value_phi414, %6
br i1 %12, label %idxend, label %oob

L35:                                              ; preds = %idxend, %top
%value_phi10 = phi i64 [ 0, %top ], [ %21, %idxend ]
ret i64 %value_phi10

oob:                                              ; preds = %L34, %L7.L12_crit_edge
%value_phi4.lcssa = phi i64 [ 1, %L7.L12_crit_edge ], [ %11, %L34 ]
%13 = alloca i64, align 8
store i64 %value_phi4.lcssa, i64* %13, align 8
call void @jl_bounds_error_ints(%jl_value_t addrspace(12)* %14, i64* nonnull %13, i64 1)
unreachable

idxend:                                           ; preds = %idxend.lr.ph, %L34
%15 = phi i64 [ 0, %idxend.lr.ph ], [ %value_phi414, %L34 ]
%value_phi414 = phi i64 [ 1, %idxend.lr.ph ], [ %11, %L34 ]
%value_phi313 = phi i64 [ 0, %idxend.lr.ph ], [ %21, %L34 ]
%16 = getelementptr inbounds i64, i64 addrspace(13)* %10, i64 %15
%18 = ashr i64 %17, 63
%19 = and i64 %18, 248
%20 = or i64 %19, 7
%value_phi6 = and i64 %20, %17
%21 = add i64 %value_phi6, %value_phi313
%22 = icmp eq i64 %value_phi414, %8
br i1 %22, label %L35, label %L34
}

julia> @code_llvm debuginfo=:none loop(f2, Int[], 1)

define i64 @julia_loop_17303(%jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40), i64) {
top:
%2 = icmp sgt i64 %1, 0
br i1 %2, label %L7.L12_crit_edge, label %L38

L7.L12_crit_edge:                                 ; preds = %top
%5 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %4, i64 0, i32 1
%7 = icmp eq i64 %6, 0
br i1 %7, label %oob, label %idxend.lr.ph

idxend.lr.ph:                                     ; preds = %L7.L12_crit_edge
%8 = select i1 %2, i64 %1, i64 0
br label %idxend

L37:                                              ; preds = %idxend
%11 = add nuw i64 %value_phi413, 1
%12 = icmp ult i64 %value_phi413, %6
br i1 %12, label %idxend, label %oob

L38:                                              ; preds = %idxend, %top
%value_phi9 = phi i64 [ 0, %top ], [ %22, %idxend ]
ret i64 %value_phi9

oob:                                              ; preds = %L37, %L7.L12_crit_edge
%value_phi4.lcssa = phi i64 [ 1, %L7.L12_crit_edge ], [ %11, %L37 ]
%13 = alloca i64, align 8
store i64 %value_phi4.lcssa, i64* %13, align 8
call void @jl_bounds_error_ints(%jl_value_t addrspace(12)* %14, i64* nonnull %13, i64 1)
unreachable

idxend:                                           ; preds = %idxend.lr.ph, %L37
%15 = phi i64 [ 0, %idxend.lr.ph ], [ %value_phi413, %L37 ]
%value_phi413 = phi i64 [ 1, %idxend.lr.ph ], [ %11, %L37 ]
%value_phi312 = phi i64 [ 0, %idxend.lr.ph ], [ %22, %L37 ]
%16 = getelementptr inbounds i64, i64 addrspace(13)* %10, i64 %15
%18 = ashr i64 %17, 63
%19 = and i64 %18, 248
%20 = or i64 %19, 7
%21 = and i64 %20, %17
%22 = add i64 %21, %value_phi312
%23 = icmp eq i64 %value_phi413, %8
br i1 %23, label %L38, label %L37
}
``````
2 Likes

Thanks for your recommendation. I changed the code to this:

``````@inline f1(c::Int64) = c>=0 ? c&7 : c&255

@inline f2(c::Int64) = c & ((c>>63) & 255 | 7)

a = 0

@inline function loopf1(data, cases, ret,j)
result ::Int64 = 0
for i in 1:cases
@inbounds result += f1(data[i])
end
ret[j] += result
result
end

@inline function loopf2(data, cases, ret, j)
result ::Int64 = 0
for i in 1:cases
@inbounds result += f2(data[i])
end
ret[j] += result
end

function simple_benchmark(cases)
Random.seed!(1234)
data = rand(Int64,cases)
ret = [0,0]
@time loopf1(data,cases,ret,1)
@time loopf2(data,cases,ret,2)
show(ret)
println()
println()
nothing
end
``````

It has no significant effect on timings (still no variant significantly faster) - except that I needed to return some result from the loops (see ret vector) - in my first attempt, the optimizer eliminated the whole loop after I put the @inline macros in the code.

Right! They’re compiling down to exactly the same code! In other words, don’t worry about this micro-optimization yourself.

I looked at your example LLVM listing, you are right, the code is identical.

Very good idea to look at the code for the whole loop, not only the f1/f2 function: the long LLVM listing is a little bit tedious to read, but it pays off:

(a) it perfectly explains the runtime results and
(b) it answers my question (to some extend):

LLVM compiles to the shift version - a strong indication that it is faster

The following code fragment is taken from your example for loop(f1, Int[], 1), it shows the compiled result for f1, I added my “source code backtranslation” as comment

``````
%18 = ashr i64 %17, 63           # (c>>63)
%19 = and i64 %18, 248           # &248 corresponds to &255, same result because following "| 7"
%20 = or i64 %19, 7              #  |7, %20 is now ((c>>63) & 248 | 7)
%21 = and i64 %20, %17           # c & bitmask ((c>>63) & 248 | 7)

``````

You are totally right. Optimization should be the last step in normal application development, and in many cases it turns out that no optimization is necessary at all. Micro-optimization, in particular, should be the domain of the machine and its code optimizer.

But …
I really love to write efficient code, and to write it efficiently. For the latter, I need some feeling on what can be left to machine optimization, and what not. I am new to Julia, but have a long development history in Java (and older languages). I have to change my coding style.

In your example, you removed the redundancy loopf1/loopf2 by passing f1/f2 as argument - clearly more concise, better maintainable code. Important to me: it has no performance penalty, it is all optimized away by LLVM. Doing the same in Java, obeying good coding style (declare function parameter as an interface) and in a nontrivial application having many functions implementing that interface, it could significantly impact performance: if the loop method is too large to get inlined, the JVM is stuck with a costly interface dispatch for the Java way of a function pointer call.

You helped me a lot towards a better understanding of Julia, thanks.

4 Likes

I had a first look at it … seems to need some familiarization. Not quite clear to me what side effects the initialization by interpolation has. I was able to run a benchmark in REPL, but putting the @benchmark statement into a test method did not work (no output).

Do you know about a tutorial with some nontrivial examples?

Also, you should be careful that the compiler may change the behavior in a loop. It didn’t in that case because bounds checking was left on and you didn’t use `@simd`, but see how much the code changes:

``````function loop(f, data, cases)
result ::Int64 = 0
@inbounds @simd for i in 1:cases
result += f(data[i])
end
result
end
@code_llvm debuginfo=:none
``````
``````define i64 @julia_loop_18052(%jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40), i64) {
top:
%2 = icmp sgt i64 %1, 0
%3 = select i1 %2, i64 %1, i64 0
%4 = add nsw i64 %3, -1
%5 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %4, i64 1)
%6 = extractvalue { i64, i1 } %5, 1
br i1 %6, label %L16, label %L21

L16:                                              ; preds = %top
call void @julia_throw_overflowerr_binaryop_12551(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140605855187504 to %jl_value_t*) to %jl_value_t addrspace(10)*), i64 %4, i64 1)
call void @llvm.trap()
unreachable

L21:                                              ; preds = %top
%7 = extractvalue { i64, i1 } %5, 0
%8 = icmp slt i64 %7, 1
br i1 %8, label %L63, label %L28.lr.ph

L28.lr.ph:                                        ; preds = %L21
%min.iters.check = icmp ult i64 %3, 32
br i1 %min.iters.check, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %L28.lr.ph
%n.vec = and i64 %3, 9223372036854775776
br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %36, %vector.body ]
%vec.phi19 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %37, %vector.body ]
%vec.phi20 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %38, %vector.body ]
%vec.phi21 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %39, %vector.body ]
%12 = getelementptr inbounds i64, i64 addrspace(13)* %11, i64 %index
%14 = getelementptr inbounds i64, i64 addrspace(13)* %12, i64 8
%16 = getelementptr inbounds i64, i64 addrspace(13)* %12, i64 16
%18 = getelementptr inbounds i64, i64 addrspace(13)* %12, i64 24
%20 = ashr <8 x i64> %wide.load, <i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63>
%21 = ashr <8 x i64> %wide.load22, <i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63>
%22 = ashr <8 x i64> %wide.load23, <i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63>
%23 = ashr <8 x i64> %wide.load24, <i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63>
%24 = and <8 x i64> %20, <i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248>
%25 = and <8 x i64> %21, <i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248>
%26 = and <8 x i64> %22, <i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248>
%27 = and <8 x i64> %23, <i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248>
%28 = or <8 x i64> %24, <i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7>
%29 = or <8 x i64> %25, <i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7>
%30 = or <8 x i64> %26, <i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7>
%31 = or <8 x i64> %27, <i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7>
%32 = and <8 x i64> %28, %wide.load
%33 = and <8 x i64> %29, %wide.load22
%34 = and <8 x i64> %30, %wide.load23
%35 = and <8 x i64> %31, %wide.load24
%36 = add <8 x i64> %32, %vec.phi
%37 = add <8 x i64> %33, %vec.phi19
%38 = add <8 x i64> %34, %vec.phi20
%39 = add <8 x i64> %35, %vec.phi21
%index.next = add i64 %index, 32
%40 = icmp eq i64 %index.next, %n.vec
br i1 %40, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
%bin.rdx = add <8 x i64> %37, %36
%bin.rdx25 = add <8 x i64> %38, %bin.rdx
%bin.rdx26 = add <8 x i64> %39, %bin.rdx25
%rdx.shuf = shufflevector <8 x i64> %bin.rdx26, <8 x i64> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
%bin.rdx27 = add <8 x i64> %bin.rdx26, %rdx.shuf
%rdx.shuf28 = shufflevector <8 x i64> %bin.rdx27, <8 x i64> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%bin.rdx29 = add <8 x i64> %bin.rdx27, %rdx.shuf28
%rdx.shuf30 = shufflevector <8 x i64> %bin.rdx29, <8 x i64> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%bin.rdx31 = add <8 x i64> %bin.rdx29, %rdx.shuf30
%41 = extractelement <8 x i64> %bin.rdx31, i32 0
%cmp.n = icmp eq i64 %3, %n.vec
br i1 %cmp.n, label %L63, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block, %L28.lr.ph
%bc.resume.val = phi i64 [ %n.vec, %middle.block ], [ 0, %L28.lr.ph ]
%bc.merge.rdx = phi i64 [ %41, %middle.block ], [ 0, %L28.lr.ph ]
br label %L28

L28:                                              ; preds = %scalar.ph, %L28
%value_phi215 = phi i64 [ %bc.resume.val, %scalar.ph ], [ %49, %L28 ]
%value_phi14 = phi i64 [ %bc.merge.rdx, %scalar.ph ], [ %48, %L28 ]
%42 = getelementptr inbounds i64, i64 addrspace(13)* %11, i64 %value_phi215
%44 = ashr i64 %43, 63
%45 = and i64 %44, 248
%46 = or i64 %45, 7
%47 = and i64 %46, %43
%48 = add i64 %47, %value_phi14
%49 = add nuw nsw i64 %value_phi215, 1
%50 = icmp ult i64 %49, %7
br i1 %50, label %L28, label %L63

L63:                                              ; preds = %L28, %middle.block, %L21
%value_phi5 = phi i64 [ 0, %L21 ], [ %48, %L28 ], [ %41, %middle.block ]
ret i64 %value_phi5
}
``````

The chunk corresponding to the snippet from your post:

`````` %20 = ashr <8 x i64> %wide.load, <i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63>
%21 = ashr <8 x i64> %wide.load22, <i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63>
%22 = ashr <8 x i64> %wide.load23, <i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63>
%23 = ashr <8 x i64> %wide.load24, <i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63>
%24 = and <8 x i64> %20, <i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248>
%25 = and <8 x i64> %21, <i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248>
%26 = and <8 x i64> %22, <i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248>
%27 = and <8 x i64> %23, <i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248>
%28 = or <8 x i64> %24, <i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7>
%29 = or <8 x i64> %25, <i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7>
%30 = or <8 x i64> %26, <i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7>
%31 = or <8 x i64> %27, <i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7>
%32 = and <8 x i64> %28, %wide.load
%33 = and <8 x i64> %29, %wide.load22
%34 = and <8 x i64> %30, %wide.load23
%35 = and <8 x i64> %31, %wide.load24
``````

So it’s definitely worth checking out what happens when your code is placed in a loop.

IMO, it’s a lot easier to make code fast if you hold performance in mind from the beginning, than it is to optimize code as an after thought. Like you say, while micro-optimization can be left to a compiler, it definitely helps to know what the compiler is capable of/likely to do, and then set up your data structures and code to encourage it.

`loopold` is the version without `@inbounds @simd`, `loop` is the version with it; example of using BenchmarkTools:

``````julia> data = rand(Int, 800);

julia> using BenchmarkTools

julia> @benchmark loopold(f2, \$data, 800)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     476.301 ns (0.00% GC)
median time:      479.168 ns (0.00% GC)
mean time:        489.404 ns (0.00% GC)
maximum time:     763.898 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     196

julia> @benchmark loop(f2, \$data, 800)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     65.316 ns (0.00% GC)
median time:      65.432 ns (0.00% GC)
mean time:        66.471 ns (0.00% GC)
maximum time:     102.284 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     979
``````
3 Likes

Just `@inbounds` is required; Julia will automatically use SIMD without `@simd` since there are no consequences to rearranging the ordering of the loop. An explicit `@simd` macro is really only needed for cartesian iteration and allowing for floating point associativity.

3 Likes