Branch elimination in Julia

ettersi · July 22, 2020, 3:49am

I have always been fascinated by the effect of branch prediction demonstrated in this StackOverflow question, and I wanted to include this example in my course to illustrate how hard it is to predict the performance of modern hardware.

Unfortunately, it seems that this example no longer works in Julia. Running the code from @Tamas_Papp’s blog post in Julia 1.4.2, I obtain the following.

"Sum elements if ≥ 128."
function sumabove_if(x)
    s = zero(eltype(x))
    for elt in x
        if elt ≥ 128
            s += elt
        end
    end
    s
end

x_rand = rand(1:256, 32768) # original example on stackoverflow, except using Int
x_sorted = sort(x_rand)

@btime sumabove_if($x_rand)     # 4.139 μs (0 allocations: 0 bytes)
@btime sumabove_if($x_sorted)   # 3.857 μs (0 allocations: 0 bytes)

As you can see, sorting the array leads to only very little improvement in performance, and definitely not a factor 5x as reported by @Tamas_Papp.

I assume what is happening here is that Julia / LLVM has become clever enough to recognise that the branch is unnecessary and eliminates it for me. So my questions are:

Is my assumption correct?
Can I work around the smartness of Julia / LLVM somehow?

For completeness:

julia> versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = "/usr/share/code/code"
  JULIA_NUM_THREADS =

jling · July 22, 2020, 4:15am

looking at @code_llvm and @code_native I would assume the branch is still there?

then I realized Spectre (security vulnerability) - Wikipedia
was not patched until 2018 so it’s after that blog post. In theory this can be related to the benchmark above, but I’m not sure.

Elrod · July 22, 2020, 5:13am

The loop has no branches on my computer:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <8 x i64> [ %12, %vector.ph ], [ %31, %vector.body ]
  %vec.phi27 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %32, %vector.body ]
  %vec.phi28 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %33, %vector.body ]
  %vec.phi29 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %34, %vector.body ]
  %offset.idx = add i64 1, %index
  %13 = add i64 %offset.idx, 0
  %14 = getelementptr inbounds i64, i64* %6, i64 %13
  %15 = getelementptr inbounds i64, i64* %14, i32 0
  %16 = bitcast i64* %15 to <8 x i64>*
  %wide.load = load <8 x i64>, <8 x i64>* %16, align 8
  %17 = getelementptr inbounds i64, i64* %14, i32 8
  %18 = bitcast i64* %17 to <8 x i64>*
  %wide.load37 = load <8 x i64>, <8 x i64>* %18, align 8
  %19 = getelementptr inbounds i64, i64* %14, i32 16
  %20 = bitcast i64* %19 to <8 x i64>*
  %wide.load38 = load <8 x i64>, <8 x i64>* %20, align 8
  %21 = getelementptr inbounds i64, i64* %14, i32 24
  %22 = bitcast i64* %21 to <8 x i64>*
  %wide.load39 = load <8 x i64>, <8 x i64>* %22, align 8
  %23 = icmp slt <8 x i64> %wide.load, <i64 128, i64 128, i64 128, i64 128, i64 128, i64 128, i64 128, i64 128>
  %24 = icmp slt <8 x i64> %wide.load37, <i64 128, i64 128, i64 128, i64 128, i64 128, i64 128, i64 128, i64 128>
  %25 = icmp slt <8 x i64> %wide.load38, <i64 128, i64 128, i64 128, i64 128, i64 128, i64 128, i64 128, i64 128>
  %26 = icmp slt <8 x i64> %wide.load39, <i64 128, i64 128, i64 128, i64 128, i64 128, i64 128, i64 128, i64 128>
  %27 = select <8 x i1> %23, <8 x i64> zeroinitializer, <8 x i64> %wide.load
  %28 = select <8 x i1> %24, <8 x i64> zeroinitializer, <8 x i64> %wide.load37
  %29 = select <8 x i1> %25, <8 x i64> zeroinitializer, <8 x i64> %wide.load38
  %30 = select <8 x i1> %26, <8 x i64> zeroinitializer, <8 x i64> %wide.load39
  %31 = add <8 x i64> %vec.phi, %27
  %32 = add <8 x i64> %vec.phi27, %28
  %33 = add <8 x i64> %vec.phi28, %29
  %34 = add <8 x i64> %vec.phi29, %30
  %index.next = add i64 %index, 32
  %35 = icmp eq i64 %index.next, %n.vec
  br i1 %35, label %middle.block, label %vector.body

Both the sorted and unsorted versions were exactly the same fast:

julia> @btime sumabove_if($x_sorted)
  2.639 μs (0 allocations: 0 bytes)
3173525

julia> @btime sumabove_if($x_rand)
  2.638 μs (0 allocations: 0 bytes)
3173525

Because you saw a difference in performance, maybe that is not the case with your computer. I would check the @code_llvm.

jling · July 22, 2020, 5:47am

right, my @code_llvm leads me to believe I still have branch:

vector.body:                                      ; preds = %vector.body, %vector.ph
    %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
    %vec.phi = phi <4 x i64> [ %12, %vector.ph ], [ %29, %vector.body ]
    %vec.phi23 = phi <4 x i64> [ zeroinitializer, %vector.ph ], [ %30, %vector.body ]
    %vec.phi24 = phi <4 x i64> [ zeroinitializer, %vector.ph ], [ %31, %vector.body ]
    %vec.phi25 = phi <4 x i64> [ zeroinitializer, %vector.ph ], [ %32, %vector.body ]
    %offset.idx = or i64 %index, 1
    %13 = getelementptr inbounds i64, i64 addrspace(13)* %7, i64 %offset.idx
; └└
  %14 = bitcast i64 addrspace(13)* %13 to <4 x i64> addrspace(13)*
  %wide.load = load <4 x i64>, <4 x i64> addrspace(13)* %14, align 8
  %15 = getelementptr inbounds i64, i64 addrspace(13)* %13, i64 4
  %16 = bitcast i64 addrspace(13)* %15 to <4 x i64> addrspace(13)*
  %wide.load33 = load <4 x i64>, <4 x i64> addrspace(13)* %16, align 8
  %17 = getelementptr inbounds i64, i64 addrspace(13)* %13, i64 8
  %18 = bitcast i64 addrspace(13)* %17 to <4 x i64> addrspace(13)*
  %wide.load34 = load <4 x i64>, <4 x i64> addrspace(13)* %18, align 8
  %19 = getelementptr inbounds i64, i64 addrspace(13)* %13, i64 12
  %20 = bitcast i64 addrspace(13)* %19 to <4 x i64> addrspace(13)*
  %wide.load35 = load <4 x i64>, <4 x i64> addrspace(13)* %20, align 8
;  @ REPL[1]:4 within `sumabove_if'
; ┌ @ operators.jl:341 within `>='
; │┌ @ int.jl:410 within `<='
    %21 = icmp slt <4 x i64> %wide.load, <i64 128, i64 128, i64 128, i64 128>
    %22 = icmp slt <4 x i64> %wide.load33, <i64 128, i64 128, i64 128, i64 128>
    %23 = icmp slt <4 x i64> %wide.load34, <i64 128, i64 128, i64 128, i64 128>
    %24 = icmp slt <4 x i64> %wide.load35, <i64 128, i64 128, i64 128, i64 128>
; └└
  %25 = select <4 x i1> %21, <4 x i64> zeroinitializer, <4 x i64> %wide.load
  %26 = select <4 x i1> %22, <4 x i64> zeroinitializer, <4 x i64> %wide.load33
  %27 = select <4 x i1> %23, <4 x i64> zeroinitializer, <4 x i64> %wide.load34
  %28 = select <4 x i1> %24, <4 x i64> zeroinitializer, <4 x i64> %wide.load35
  %29 = add <4 x i64> %25, %vec.phi
  %30 = add <4 x i64> %26, %vec.phi23
  %31 = add <4 x i64> %27, %vec.phi24
  %32 = add <4 x i64> %28, %vec.phi25
  %index.next = add i64 %index, 16
  %33 = icmp eq i64 %index.next, %n.vec
  br i1 %33, label %middle.block, label %vector.body

though, the performance is about the same (not as close as your result)

Elrod · July 22, 2020, 7:39am

The LLVM doesn’t show a branch, making the performance difference you observe curious. Are you sure it isn’t just noise?

Benchmarking on my laptop:

julia> @btime sumabove_if($x_rand)
  10.587 μs (0 allocations: 0 bytes)
3193867

julia> @btime sumabove_if($x_sorted)
  11.898 μs (0 allocations: 0 bytes)
3193867

This is fun:

julia> @btime sumabove_if($x_sorted)
  11.913 μs (0 allocations: 0 bytes)
3193867

julia> @btime sumabove_if($x_rand)
  10.618 μs (0 allocations: 0 bytes)
3193867

julia> copyto!(x_rand, x_sorted);

julia> @btime sumabove_if($x_rand)
  10.625 μs (0 allocations: 0 bytes)
3193867

The speed difference was based on the x_rand vs x_sorted vector, rather than their actual contents ;). This also meant the speed of benchmarking the same vector over and over again was consistent on this laptop. Try using the same vector but Random.shuffle!ed or sort!ed contents.

I am not really sure why that would be, so I’m not going to speculate. FWIW:

julia> reinterpret(Int, pointer(x_rand)) % 4096
768

julia> reinterpret(Int, pointer(x_sorted)) % 4096
2880

ettersi · July 22, 2020, 7:50am

Any chance I can the compiler not to eliminate the branch so I can still illustrate the effect?

tisztamo · July 22, 2020, 8:04am

This version shows the difference, although a bit ugly (and needs a new name):

julia> function sumabove_if(x,y)
           s = zero(eltype(x))
           for i = 1:length(x)
              s += x[i] >= 128 ? x[i] : y[i]
           end
           s
       end
sumabove_if

julia> @btime sumabove_if($x_rand, $x_rand)
  344.804 ms (0 allocations: 0 bytes)
12849273531

julia> @btime sumabove_if($x_sorted, $x_sorted)
  80.943 ms (0 allocations: 0 bytes)
12849273531

julia> @btime sumabove_if($x_sorted, $x_rand)
  89.936 ms (0 allocations: 0 bytes)
16049723715

julia> @btime sumabove_if($x_rand, $x_sorted)
  347.906 ms (0 allocations: 0 bytes)
16049408163

ettersi · July 22, 2020, 12:18pm

Perfect, exactly what I was looking for. Thanks to everyone for their help!

Topic		Replies	Views
Out of order branch execution prediction slower? Performance	17	1013	May 26, 2020
Why I love Julia Community announcement	34	4822	October 11, 2019
MacOS ARM64 no faster than emulated x86? Performance	17	2078	January 22, 2022
Branching in loop decrease perf by ~x5 Performance	6	779	December 13, 2020
Are branches in Julia a problem for performance? Performance	10	1091	December 29, 2023

Branch elimination in Julia

Related topics