Performance: replace conditional jump by some shift/and/or?

I work on a data type which stores several fields bit-packed into a container c :: Int64. There are two different storage layouts, depending on the most significant bit. A simplified example: if bit63 is 0, my bit-packed field f is stored in bit0…bit3 of c, otherwise in bit0…bit7.

An intuitive extraction code for f could be:

f1(c::Int64) = c>=0 ? c&7 : c&255

Alternatively, I can avoid conditional jumps and construct the extraction bit mask by extending the sign bit:

f2(c::Int64) = c & ((c>>63) & 255 | 7)

I thought the less intuitive f2 will be significantly faster, if positive and negative values of c occur randomly with similar probability. Reason: a modern CPU has to rebuild its micro-op instruction queue if branch prediction fails, which costs more than 3 additional elementary shift/and/or operations on registers.

I tried to verify that by simple benchmarking, but failed: no significant results, probably due to loop overhead (1 branch, 1 array access per iteration). Results below.

How could I obtain significant benchmark results? Or am I wrong in my assumption that conditional branches are way more expensive than a couple of elementary integer operations?

My test code:

f1(c::Int64) = c>=0 ? c&7 : c&255

f2(c::Int64) = c & ((c>>63) & 255 | 7)

function loopf1(data, cases)
    result ::Int64 = 0
    for i in 1:cases
        result += f1(data[i])
    end
    result
end

function loopf2(data, cases)
    result ::Int64 = 0
    for i in 1:cases
        result += f2(data[i])
    end
    result
end

function simple_benchmark(cases)
    Random.seed!(1234)
    data = rand(Int64,cases)
    @time loopf1(data,cases)
    @time loopf2(data,cases)
    nothing
end

My test results (quite slow notebook with Pentium 4415U, Windows 10 64-bit, Julia 1.0.5):

julia> simple_benchmark(2000000)
  0.000872 seconds
  0.001007 seconds

julia> simple_benchmark(2000000)
  0.001912 seconds
  0.001241 seconds

julia> simple_benchmark(2000000)
  0.000886 seconds
  0.001525 seconds

julia> simple_benchmark(20000000)
  0.008750 seconds
  0.013753 seconds

julia> simple_benchmark(20000000)
  0.009302 seconds
  0.008985 seconds

julia> simple_benchmark(20000000)
  0.009369 seconds
  0.008960 seconds

julia> simple_benchmark(20000000)
  0.009334 seconds
  0.009140 seconds

I believe the two functions are identical; both are using bit-twiddling! Julia/LLVM can often turn branches into bit-twiddles or ifelse selects if it is obvious and advantageous. I think the differences you’re seeing are just caching/memory location effects. There’s also some overhead from bounds-checking mixed in; if you remove bounds-checks with @inbounds then it will SIMD the bit-twiddling — in both versions!

2 Likes

Aside: I recommend using BenchmarkTools.jl

1 Like

They are generating different code, and the shift is definitely slower on my i7. I’m using julia 1.2, and I altered some of the commented quotes in the generated code to help the syntax highlighter.

julia> c = 1
1

julia> @code_native f1(c)
	.text
; ┌ @ timing.jl:1 within 'f1'
; │┌ @ operators.jl:341 within '>='
; ││┌ @ timing.jl:1 within '<='
	testq	%rdi, %rdi
; │└└
	js	L12
; │┌ @ int.jl:293 within '&'
	andl	$7, %edi
; │└
	movq	%rdi, %rax
	retq
; │┌ @ int.jl:293 within '&'
L12:
	movzbl	%dil, %eax
; │└
	retq
	nopw	%cs:(%rax,%rax)
; └

julia> @code_native f2(c)
	.text
; ┌ @ timing.jl:3 within 'f2'
; │┌ @ int.jl:444 within '>>' @ timing.jl:3
	movq	%rdi, %rax
	sarq	$63, %rax
; │└
; │┌ @ int.jl:293 within '&'
	andl	$248, %eax
; │└
; │┌ @ int.jl:316 within '|'
	leaq	7(%rax), %rax
; │└
; │┌ @ int.jl:293 within '&'
	andq	%rdi, %rax
; │└
	retq
	nopw	%cs:(%rax,%rax)
; └

Ah, but check out what happens when you put them in a loop :slight_smile:

Example
julia> versioninfo()
Julia Version 1.4.0-DEV.374
Commit f806717255 (2019-10-24 17:01 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

julia> f1(c::Int64) = c>=0 ? c&7 : c&255

       f2(c::Int64) = c & ((c>>63) & 255 | 7)

       function loop(f, data, cases)
           result ::Int64 = 0
           for i in 1:cases
               result += f(data[i])
           end
           result
       end
loop (generic function with 1 method)
julia> @code_llvm debuginfo=:none loop(f1, Int[], 1)

define i64 @julia_loop_17302(%jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40), i64) {
top:
  %2 = icmp sgt i64 %1, 0
  br i1 %2, label %L7.L12_crit_edge, label %L35

L7.L12_crit_edge:                                 ; preds = %top
  %3 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
  %4 = bitcast %jl_value_t addrspace(11)* %3 to %jl_array_t addrspace(11)*
  %5 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %4, i64 0, i32 1
  %6 = load i64, i64 addrspace(11)* %5, align 8
  %7 = icmp eq i64 %6, 0
  br i1 %7, label %oob, label %idxend.lr.ph

idxend.lr.ph:                                     ; preds = %L7.L12_crit_edge
  %8 = select i1 %2, i64 %1, i64 0
  %9 = bitcast %jl_value_t addrspace(11)* %3 to i64 addrspace(13)* addrspace(11)*
  %10 = load i64 addrspace(13)*, i64 addrspace(13)* addrspace(11)* %9, align 8
  br label %idxend

L34:                                              ; preds = %idxend
  %11 = add nuw i64 %value_phi414, 1
  %12 = icmp ult i64 %value_phi414, %6
  br i1 %12, label %idxend, label %oob

L35:                                              ; preds = %idxend, %top
  %value_phi10 = phi i64 [ 0, %top ], [ %21, %idxend ]
  ret i64 %value_phi10

oob:                                              ; preds = %L34, %L7.L12_crit_edge
  %value_phi4.lcssa = phi i64 [ 1, %L7.L12_crit_edge ], [ %11, %L34 ]
  %13 = alloca i64, align 8
  store i64 %value_phi4.lcssa, i64* %13, align 8
  %14 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(12)*
  call void @jl_bounds_error_ints(%jl_value_t addrspace(12)* %14, i64* nonnull %13, i64 1)
  unreachable

idxend:                                           ; preds = %idxend.lr.ph, %L34
  %15 = phi i64 [ 0, %idxend.lr.ph ], [ %value_phi414, %L34 ]
  %value_phi414 = phi i64 [ 1, %idxend.lr.ph ], [ %11, %L34 ]
  %value_phi313 = phi i64 [ 0, %idxend.lr.ph ], [ %21, %L34 ]
  %16 = getelementptr inbounds i64, i64 addrspace(13)* %10, i64 %15
  %17 = load i64, i64 addrspace(13)* %16, align 8
  %18 = ashr i64 %17, 63
  %19 = and i64 %18, 248
  %20 = or i64 %19, 7
  %value_phi6 = and i64 %20, %17
  %21 = add i64 %value_phi6, %value_phi313
  %22 = icmp eq i64 %value_phi414, %8
  br i1 %22, label %L35, label %L34
}

julia> @code_llvm debuginfo=:none loop(f2, Int[], 1)

define i64 @julia_loop_17303(%jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40), i64) {
top:
  %2 = icmp sgt i64 %1, 0
  br i1 %2, label %L7.L12_crit_edge, label %L38

L7.L12_crit_edge:                                 ; preds = %top
  %3 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
  %4 = bitcast %jl_value_t addrspace(11)* %3 to %jl_array_t addrspace(11)*
  %5 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %4, i64 0, i32 1
  %6 = load i64, i64 addrspace(11)* %5, align 8
  %7 = icmp eq i64 %6, 0
  br i1 %7, label %oob, label %idxend.lr.ph

idxend.lr.ph:                                     ; preds = %L7.L12_crit_edge
  %8 = select i1 %2, i64 %1, i64 0
  %9 = bitcast %jl_value_t addrspace(11)* %3 to i64 addrspace(13)* addrspace(11)*
  %10 = load i64 addrspace(13)*, i64 addrspace(13)* addrspace(11)* %9, align 8
  br label %idxend

L37:                                              ; preds = %idxend
  %11 = add nuw i64 %value_phi413, 1
  %12 = icmp ult i64 %value_phi413, %6
  br i1 %12, label %idxend, label %oob

L38:                                              ; preds = %idxend, %top
  %value_phi9 = phi i64 [ 0, %top ], [ %22, %idxend ]
  ret i64 %value_phi9

oob:                                              ; preds = %L37, %L7.L12_crit_edge
  %value_phi4.lcssa = phi i64 [ 1, %L7.L12_crit_edge ], [ %11, %L37 ]
  %13 = alloca i64, align 8
  store i64 %value_phi4.lcssa, i64* %13, align 8
  %14 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(12)*
  call void @jl_bounds_error_ints(%jl_value_t addrspace(12)* %14, i64* nonnull %13, i64 1)
  unreachable

idxend:                                           ; preds = %idxend.lr.ph, %L37
  %15 = phi i64 [ 0, %idxend.lr.ph ], [ %value_phi413, %L37 ]
  %value_phi413 = phi i64 [ 1, %idxend.lr.ph ], [ %11, %L37 ]
  %value_phi312 = phi i64 [ 0, %idxend.lr.ph ], [ %22, %L37 ]
  %16 = getelementptr inbounds i64, i64 addrspace(13)* %10, i64 %15
  %17 = load i64, i64 addrspace(13)* %16, align 8
  %18 = ashr i64 %17, 63
  %19 = and i64 %18, 248
  %20 = or i64 %19, 7
  %21 = and i64 %20, %17
  %22 = add i64 %21, %value_phi312
  %23 = icmp eq i64 %value_phi413, %8
  br i1 %23, label %L38, label %L37
}
2 Likes

Thanks for your recommendation. I changed the code to this:

@inline f1(c::Int64) = c>=0 ? c&7 : c&255

@inline f2(c::Int64) = c & ((c>>63) & 255 | 7)

a = 0

@inline function loopf1(data, cases, ret,j)
    result ::Int64 = 0
    for i in 1:cases
        @inbounds result += f1(data[i])
    end
    ret[j] += result
    result
end

@inline function loopf2(data, cases, ret, j)
    result ::Int64 = 0
    for i in 1:cases
        @inbounds result += f2(data[i])
    end
    ret[j] += result
end


function simple_benchmark(cases)
    Random.seed!(1234)
    data = rand(Int64,cases)
    ret = [0,0]
    @time loopf1(data,cases,ret,1)
    @time loopf2(data,cases,ret,2)
    show(ret)
    println()
    println()
    nothing
end

It has no significant effect on timings (still no variant significantly faster) - except that I needed to return some result from the loops (see ret vector) - in my first attempt, the optimizer eliminated the whole loop after I put the @inline macros in the code.

Right! They’re compiling down to exactly the same code! In other words, don’t worry about this micro-optimization yourself.

I looked at your example LLVM listing, you are right, the code is identical.

Very good idea to look at the code for the whole loop, not only the f1/f2 function: the long LLVM listing is a little bit tedious to read, but it pays off:

(a) it perfectly explains the runtime results and
(b) it answers my question (to some extend):

LLVM compiles to the shift version - a strong indication that it is faster :slight_smile:

The following code fragment is taken from your example for loop(f1, Int[], 1), it shows the compiled result for f1, I added my “source code backtranslation” as comment


  %18 = ashr i64 %17, 63           # (c>>63)
  %19 = and i64 %18, 248           # &248 corresponds to &255, same result because following "| 7"
  %20 = or i64 %19, 7              #  |7, %20 is now ((c>>63) & 248 | 7)
  %21 = and i64 %20, %17           # c & bitmask ((c>>63) & 248 | 7)

don’t worry about this micro-optimization yourself.

You are totally right. Optimization should be the last step in normal application development, and in many cases it turns out that no optimization is necessary at all. Micro-optimization, in particular, should be the domain of the machine and its code optimizer.

But …
I really love to write efficient code, and to write it efficiently. For the latter, I need some feeling on what can be left to machine optimization, and what not. I am new to Julia, but have a long development history in Java (and older languages). I have to change my coding style.

In your example, you removed the redundancy loopf1/loopf2 by passing f1/f2 as argument - clearly more concise, better maintainable code. Important to me: it has no performance penalty, it is all optimized away by LLVM. Doing the same in Java, obeying good coding style (declare function parameter as an interface) and in a nontrivial application having many functions implementing that interface, it could significantly impact performance: if the loop method is too large to get inlined, the JVM is stuck with a costly interface dispatch for the Java way of a function pointer call.

You helped me a lot towards a better understanding of Julia, thanks.

4 Likes

I had a first look at it … seems to need some familiarization. Not quite clear to me what side effects the initialization by interpolation has. I was able to run a benchmark in REPL, but putting the @benchmark statement into a test method did not work (no output).

Do you know about a tutorial with some nontrivial examples?

Also, you should be careful that the compiler may change the behavior in a loop. It didn’t in that case because bounds checking was left on and you didn’t use @simd, but see how much the code changes:

function loop(f, data, cases)
    result ::Int64 = 0
    @inbounds @simd for i in 1:cases
        result += f(data[i])
    end
    result
end
@code_llvm debuginfo=:none
define i64 @julia_loop_18052(%jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40), i64) {
top:
  %2 = icmp sgt i64 %1, 0
  %3 = select i1 %2, i64 %1, i64 0
  %4 = add nsw i64 %3, -1
  %5 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %4, i64 1)
  %6 = extractvalue { i64, i1 } %5, 1
  br i1 %6, label %L16, label %L21

L16:                                              ; preds = %top
  call void @julia_throw_overflowerr_binaryop_12551(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140605855187504 to %jl_value_t*) to %jl_value_t addrspace(10)*), i64 %4, i64 1)
  call void @llvm.trap()
  unreachable

L21:                                              ; preds = %top
  %7 = extractvalue { i64, i1 } %5, 0
  %8 = icmp slt i64 %7, 1
  br i1 %8, label %L63, label %L28.lr.ph

L28.lr.ph:                                        ; preds = %L21
  %9 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
  %10 = bitcast %jl_value_t addrspace(11)* %9 to i64 addrspace(13)* addrspace(11)*
  %11 = load i64 addrspace(13)*, i64 addrspace(13)* addrspace(11)* %10, align 8
  %min.iters.check = icmp ult i64 %3, 32
  br i1 %min.iters.check, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %L28.lr.ph
  %n.vec = and i64 %3, 9223372036854775776
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %36, %vector.body ]
  %vec.phi19 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %37, %vector.body ]
  %vec.phi20 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %38, %vector.body ]
  %vec.phi21 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %39, %vector.body ]
  %12 = getelementptr inbounds i64, i64 addrspace(13)* %11, i64 %index
  %13 = bitcast i64 addrspace(13)* %12 to <8 x i64> addrspace(13)*
  %wide.load = load <8 x i64>, <8 x i64> addrspace(13)* %13, align 8
  %14 = getelementptr inbounds i64, i64 addrspace(13)* %12, i64 8
  %15 = bitcast i64 addrspace(13)* %14 to <8 x i64> addrspace(13)*
  %wide.load22 = load <8 x i64>, <8 x i64> addrspace(13)* %15, align 8
  %16 = getelementptr inbounds i64, i64 addrspace(13)* %12, i64 16
  %17 = bitcast i64 addrspace(13)* %16 to <8 x i64> addrspace(13)*
  %wide.load23 = load <8 x i64>, <8 x i64> addrspace(13)* %17, align 8
  %18 = getelementptr inbounds i64, i64 addrspace(13)* %12, i64 24
  %19 = bitcast i64 addrspace(13)* %18 to <8 x i64> addrspace(13)*
  %wide.load24 = load <8 x i64>, <8 x i64> addrspace(13)* %19, align 8
  %20 = ashr <8 x i64> %wide.load, <i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63>
  %21 = ashr <8 x i64> %wide.load22, <i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63>
  %22 = ashr <8 x i64> %wide.load23, <i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63>
  %23 = ashr <8 x i64> %wide.load24, <i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63>
  %24 = and <8 x i64> %20, <i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248>
  %25 = and <8 x i64> %21, <i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248>
  %26 = and <8 x i64> %22, <i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248>
  %27 = and <8 x i64> %23, <i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248>
  %28 = or <8 x i64> %24, <i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7>
  %29 = or <8 x i64> %25, <i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7>
  %30 = or <8 x i64> %26, <i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7>
  %31 = or <8 x i64> %27, <i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7>
  %32 = and <8 x i64> %28, %wide.load
  %33 = and <8 x i64> %29, %wide.load22
  %34 = and <8 x i64> %30, %wide.load23
  %35 = and <8 x i64> %31, %wide.load24
  %36 = add <8 x i64> %32, %vec.phi
  %37 = add <8 x i64> %33, %vec.phi19
  %38 = add <8 x i64> %34, %vec.phi20
  %39 = add <8 x i64> %35, %vec.phi21
  %index.next = add i64 %index, 32
  %40 = icmp eq i64 %index.next, %n.vec
  br i1 %40, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
  %bin.rdx = add <8 x i64> %37, %36
  %bin.rdx25 = add <8 x i64> %38, %bin.rdx
  %bin.rdx26 = add <8 x i64> %39, %bin.rdx25
  %rdx.shuf = shufflevector <8 x i64> %bin.rdx26, <8 x i64> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
  %bin.rdx27 = add <8 x i64> %bin.rdx26, %rdx.shuf
  %rdx.shuf28 = shufflevector <8 x i64> %bin.rdx27, <8 x i64> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %bin.rdx29 = add <8 x i64> %bin.rdx27, %rdx.shuf28
  %rdx.shuf30 = shufflevector <8 x i64> %bin.rdx29, <8 x i64> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %bin.rdx31 = add <8 x i64> %bin.rdx29, %rdx.shuf30
  %41 = extractelement <8 x i64> %bin.rdx31, i32 0
  %cmp.n = icmp eq i64 %3, %n.vec
  br i1 %cmp.n, label %L63, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block, %L28.lr.ph
  %bc.resume.val = phi i64 [ %n.vec, %middle.block ], [ 0, %L28.lr.ph ]
  %bc.merge.rdx = phi i64 [ %41, %middle.block ], [ 0, %L28.lr.ph ]
  br label %L28

L28:                                              ; preds = %scalar.ph, %L28
  %value_phi215 = phi i64 [ %bc.resume.val, %scalar.ph ], [ %49, %L28 ]
  %value_phi14 = phi i64 [ %bc.merge.rdx, %scalar.ph ], [ %48, %L28 ]
  %42 = getelementptr inbounds i64, i64 addrspace(13)* %11, i64 %value_phi215
  %43 = load i64, i64 addrspace(13)* %42, align 8
  %44 = ashr i64 %43, 63
  %45 = and i64 %44, 248
  %46 = or i64 %45, 7
  %47 = and i64 %46, %43
  %48 = add i64 %47, %value_phi14
  %49 = add nuw nsw i64 %value_phi215, 1
  %50 = icmp ult i64 %49, %7
  br i1 %50, label %L28, label %L63

L63:                                              ; preds = %L28, %middle.block, %L21
  %value_phi5 = phi i64 [ 0, %L21 ], [ %48, %L28 ], [ %41, %middle.block ]
  ret i64 %value_phi5
}

The chunk corresponding to the snippet from your post:

 %20 = ashr <8 x i64> %wide.load, <i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63>
  %21 = ashr <8 x i64> %wide.load22, <i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63>
  %22 = ashr <8 x i64> %wide.load23, <i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63>
  %23 = ashr <8 x i64> %wide.load24, <i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63, i64 63>
  %24 = and <8 x i64> %20, <i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248>
  %25 = and <8 x i64> %21, <i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248>
  %26 = and <8 x i64> %22, <i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248>
  %27 = and <8 x i64> %23, <i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248, i64 248>
  %28 = or <8 x i64> %24, <i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7>
  %29 = or <8 x i64> %25, <i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7>
  %30 = or <8 x i64> %26, <i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7>
  %31 = or <8 x i64> %27, <i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7, i64 7>
  %32 = and <8 x i64> %28, %wide.load
  %33 = and <8 x i64> %29, %wide.load22
  %34 = and <8 x i64> %30, %wide.load23
  %35 = and <8 x i64> %31, %wide.load24

So it’s definitely worth checking out what happens when your code is placed in a loop.

IMO, it’s a lot easier to make code fast if you hold performance in mind from the beginning, than it is to optimize code as an after thought. Like you say, while micro-optimization can be left to a compiler, it definitely helps to know what the compiler is capable of/likely to do, and then set up your data structures and code to encourage it.

loopold is the version without @inbounds @simd, loop is the version with it; example of using BenchmarkTools:

julia> data = rand(Int, 800);

julia> using BenchmarkTools

julia> @benchmark loopold(f2, $data, 800)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     476.301 ns (0.00% GC)
  median time:      479.168 ns (0.00% GC)
  mean time:        489.404 ns (0.00% GC)
  maximum time:     763.898 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     196

julia> @benchmark loop(f2, $data, 800)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     65.316 ns (0.00% GC)
  median time:      65.432 ns (0.00% GC)
  mean time:        66.471 ns (0.00% GC)
  maximum time:     102.284 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     979
3 Likes

Just @inbounds is required; Julia will automatically use SIMD without @simd since there are no consequences to rearranging the ordering of the loop. An explicit @simd macro is really only needed for cartesian iteration and allowing for floating point associativity.

3 Likes