Out of order branch execution prediction slower?

Hi,

following the Parallel Computing course in JuliaAcademy I wanted to try the hardware effects of branch prediction (https://juliaacademy.com/courses/parallel-computing/lectures/10036462, time about 33:00).

The idea is that in the second case the data is sorted, and then the hardware should be able to easily predict which way the branch instruction will go (for the first half of the array being true and for the second half being false), thus being able to show a better performance in the second case.

In the JuliaAcademy recorded lesson, since he is using a cloud service where this sort of prediction is disabled, the performance is basically the same in both versions, but in my computer the second case is actually worse (as you can see below). Any idea why this could be? (I’m running this on a 6-core Intel W3690 chip).

Thanks

using BenchmarkTools
data = rand(5000)

function findclosest2(data, point)
    bestval = first(data)
    bestdist = abs(bestval - point)
    for elt in data
        dist = abs(elt - point)
        if dist < bestdist
            bestval = elt
            bestdist = dist
        end
    end
    return bestval
end

@benchmark findclosest2($data, $0.5)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.812 μs (0.00% GC)
  median time:      6.814 μs (0.00% GC)
  mean time:        6.875 μs (0.00% GC)
  maximum time:     33.603 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

sorteddata = sort(data)
@benchmark findclosest2($sorteddata, $0.5)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.060 μs (0.00% GC)
  median time:      8.062 μs (0.00% GC)
  mean time:        8.154 μs (0.00% GC)
  maximum time:     44.750 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     3

I think branch prediction doesn’t have a lot to do with it:

julia> numupdates(data, point) = begin
           bestval = first(data)
           bestdist = abs(bestval - point)
           result = 0
           for elt in data
               dist = abs(elt - point)
               if dist < bestdist
                   bestval = elt
                   bestdist = dist
                   result += 1
               end
           end
           return result
       end
numupdates (generic function with 1 method)

julia> numupdates(data, 0.5)
13

julia> numupdates(sorteddata, 0.5)
2506

(You can probably also see this if you use a profiler to find out where the extra 10% of running time is being spent.)

I’m not an expert on the topic, but I’m not sure whether this is a great example of where branch prediction is useful. Usually, branch prediction is useful because it allows the processor to speculatively take an action before the branch condition has been evaluated. The one example that’s top of mind for me is memory access: the processor can already load memory that it might use if the condition ends up being true. (This is at the heart of the Spectre vulnerability.) But in the example from your post, there is no memory access that depends on the branch condition.

1 Like

Realized that it wasn’t branch prediction, it was cache locality in my example. Never mind :flushed:

OK, it is true that the findclosest2 function was probably not the best option to illustrate this, so I came up with my own function, which now does indeed show nicely the effect of branch prediction:

using Random
using BenchmarkTools

function test_branch_pred(n)
    @assert n%2 == 0 "n has to be even"
    data=rand(n)
    mask_pred = vcat(trues(n ÷ 2),falses(n ÷ 2))
    mask_rand = shuffle(mask_pred)
    return data,mask_pred,mask_rand
end

function branch_pred(data,idx)
    result = 0
    for i in eachindex(idx)
        if idx[i]
            result += data[i]
        end
    end
    return result
end

So the idea is to do the same amount of work, but in one case with the elements to add all randomly scattered, while in the other case the numbers to add will be in the first half on the data array (so it should become predictable and faster to compute):

julia> data,mask_pred,mask_rand = test_branch_pred(10_000_000);

julia> @btime branch_pred(data,mask_rand)
  64.559 ms (1 allocation: 16 bytes)
2.499738638877518e6

julia> @btime branch_pred(data,mask_pred)
  30.814 ms (1 allocation: 16 bytes)
2.499611161282873e6

You are almost certainly affected by memory locality and prefetching here)

Hi,

do you mean that the effect is due to cache issues rather than branch prediction? I don’t see how. See that the array data is accessed always in the same order and the mask only says which data will be added.

Also, I see the effect when the number of elements in data is very low:

julia> data,mask_pred,mask_rand = test_branch_pred(100);

julia> @btime branch_pred(data,mask_pred)
  364.587 ns (1 allocation: 16 bytes)
23.95964518146636

julia> @btime branch_pred(data,mask_rand)
  647.211 ns (1 allocation: 16 bytes)
22.66112571590415

No, the mask also control which data is accessed.

(Also I don’t know which effect is more important, but one is certainly more prefetch and cache friendly than the other.)

I am not so sure, @yuyichao. I have slightly changed the code of @angelv, to do an equivalent computation in the other branch (the missing else) and, this way, force the access to all data: a big difference remains.

using Random
using BenchmarkTools

function test_branch_pred(n)
    @assert n%2 == 0 "n has to be even"
    data=rand(n)
    mask_pred = vcat(trues(n ÷ 2),falses(n ÷ 2)) 
    mask_rand = shuffle(mask_pred)
    return data,mask_pred,mask_rand
end

function branch_pred(data,idx)
    result = zero(eltype(data))
    for i in eachindex(idx)
        if idx[i]
            result += data[i]
        else
            result -= data[i]
        end 
    end 
    return result
end

gives

julia> data,mask_pred,mask_rand = test_branch_pred(10_000_000);

julia> @btime branch_pred(data,mask_rand)
  50.318 ms (1 allocation: 16 bytes)
1484.1824331343212

julia> @btime branch_pred(data,mask_pred)
  13.271 ms (1 allocation: 16 bytes)
747.2835738745185

Yes, this is a much better benchmark than the original one above.

1 Like

I was also modifying my function to also get rid of the different accesses to data, which is true that it would be accessed a bit different due to the mask.

So now I get rid of data alltogether:

 function branch_pred(idx)
    result = 0
    for i in eachindex(idx)
        if idx[i]
            result += 1.0
        end
    end
    return result
end
julia> data,mask_pred,mask_rand = test_branch_pred(100_000);

julia> @btime branch_pred($mask_rand)
  652.528 μs (0 allocations: 0 bytes)
50000.0

julia> @btime branch_pred($mask_pred)
  308.123 μs (0 allocations: 0 bytes)
50000.0

And if I check the cache misses and branch mispredicts with LinuxPerf, the cache_misses are almost the same (4012 vs. 3776), but the branch_mispredicts go from 124_943 to 476_370_176

julia> @btime branch_pred($mask_pred)                
  331.900 μs (0 allocations: 0 bytes)           
50000.0                                         
                                                
julia> counters(bench)                                                  
hw:cycles:                                      
                 18682758098 (69.0 %)                
hw:cache_access:                                
                       26783 (68.9 %)                
hw:cache_misses:                                
                        4012 (68.9 %)                
hw:branches:                                    
                 11313994172 (67.7 %)                
hw:branch_mispredicts:                          
                      124943 (67.7 %)                
hw:instructions:                                
                 50104659688 (67.7 %)                
sw:ctx_switches:                                
                           0 (100.0 %)               
sw:page_faults:                                 
                           0 (100.0 %)               
sw:minor_page_faults:                           
                           0 (100.0 %)               
sw:major_page_faults:                                
                           0 (100.0 %)               
sw:cpu_migrations:                                   
                           0 (100.0 %)
                                                
                        
julia> reset!(bench)                    


julia> @btime branch_pred($mask_rand)
  673.435 μs (0 allocations: 0 bytes)
50000.0

julia> counters(bench)
hw:cycles: 
                 23242790416 (68.9 %)
hw:cache_access: 
                       34866 (68.8 %)
hw:cache_misses: 
                        3776 (68.8 %)
hw:branches: 
                  6681805539 (67.6 %)
hw:branch_mispredicts: 
                   476370176 (67.6 %)
hw:instructions: 
                 29590594824 (67.6 %)
sw:ctx_switches: 
                           0 (100.0 %)
sw:page_faults: 
                           0 (100.0 %)
sw:minor_page_faults: 
                           0 (100.0 %)
sw:major_page_faults: 
                           0 (100.0 %)
sw:cpu_migrations: 
                           0 (100.0 %)
1 Like

And this is a better benchmark than mine, XD.

I think the fact the difference between cycles and cache access/misses is small compared to the branch mispredicts is a strong indicator. However, why the difference in the number of instructions is about 66% bigger in the code that runs the fastest (i.e., the one that predicts it right)? The two computations do the same effort but in different order, if anything, the one with more branch mispredicts could have a larger instructions count because it could be counting the wasted instructions that were executed in the pipeline but then thrown away when the mispredict was perceived. Someone with better profiling background can explain this?

1 Like

Yes, I was also worrying about that (the bigger number of instructions in the ‘predictable’ run), but I don’t understand what could cause it.

uhmmm, strange, the difference disappears in the code below:

using Random
using BenchmarkTools

function test_branch_pred(n)
    @assert n%2 == 0 "n has to be even"
    mask_pred = vcat(trues(n ÷ 2),falses(n ÷ 2))
    mask_rand = shuffle(mask_pred)
    return mask_pred,mask_rand
end

function branch_pred(idx)
    result = 0.0
    for i in eachindex(idx)
        if idx[i]
            result += 1.0
        else
            result -= 1.0
        end
    end
    return result
end

gives

julia> mask_pred,mask_rand = test_branch_pred(10_000_000);

julia> @btime branch_pred($mask_rand)
  10.560 ms (0 allocations: 0 bytes)
0.0

julia> @btime branch_pred($mask_pred)
  10.560 ms (0 allocations: 0 bytes)
0.0

It is possible that the compiler is optimizing the branch away, I would need to understand assembler better to know for sure. I wanted a if-else test that reproduced the difference just to be sure.

As far as I understand, assembly code will not help you to undersand what is going on there, since this is happening at a hardware level.

In any case, are you sure those numbers are legit? It looks a bit suspicious that you get exactly the same running time, down to the microsecond…

I tried with exactly the change you made (adding else branch result -= 1.0) and my timings are:

julia> versioninfo()
Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU           W3690  @ 3.47GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, westmere)

julia> data,mask_pred,mask_rand = test_branch_pred(10_000_000);

julia> @btime branch_pred($mask_rand)
  53.112 ms (0 allocations: 0 bytes)
0.0

julia> @btime branch_pred($mask_pred)
  16.317 ms (0 allocations: 0 bytes)
0.0

:slight_smile: Just realized that my code was NOT exactly like yours. I get the performace boost for the predictable branch when I initialize result as result = 0. If I initialize it as result = 0.0 as you did, then I also get the same results as you, no improvement between the ‘predicatable’ and the ‘random’ masks:

julia> @btime branch_pred($mask_rand)
  21.449 ms (0 allocations: 0 bytes)
0.0

julia> @btime branch_pred($mask_pred)
  21.449 ms (0 allocations: 0 bytes)
0.0

But in the version without the else branch, there is a big difference regardless of whether result is initialized as 0 or as 0.0.

I’m clueless about why the initialization of result can impact the number of branch mispredicts…

OK, and my last one on this :slight_smile:

Contrary to what I said above, looking at the generated code DOES certainly help to understand what is going on:

If we have the following code:

function test_branch_pred_bool(n)
    @assert n%2 == 0 "n has to be even"
    mask_pred = vcat(trues(n ÷ 2),falses(n ÷ 2))
    mask_rand = shuffle(mask_pred)
    return mask_pred,mask_rand
end

function branch_pred_bool(idx)
    result = 0.0
    for i in eachindex(idx)
        if idx[i]
            result += 1.0
        else
            result -= 1.0
        end
    end
    return result
end
julia> mask_pred_bool,mask_rand_bool = test_branch_pred_bool(10_000_000);

julia> @code_llvm branch_pred_bool(mask_rand_bool)

This generates the following code:

[...]

      %26 = icmp eq i64 %25, 0
; └└└└
  %value_phi6.v = select i1 %26, double -1.000000e+00, double 1.000000e+00
  %value_phi6 = fadd double %value_phi3, %value_phi6.v

so the value to add is chosen with the instruction select, no branching involved, and hence the same number of branch mispredicts and the same performance in the ‘predictable’ and ‘random’ masks runs.

If we generate the mask as ints and use elseif to select between 1 and 0 (if we use just else then the option to run can also be calculated with select):

function test_branch_pred_int(n)
    @assert n%2 == 0 "n has to be even"
    mask_pred = vcat(ones(Int,n ÷ 2),zeros(Int,n ÷ 2))
    mask_rand = shuffle(mask_pred)
    return mask_pred,mask_rand
end

function branch_pred_int(idx)
    result = 0.0
    for i in eachindex(idx)
        if idx[i] == 1
            result += 1.0
        elseif idx[i] == 0
            result -= 1.0
        end
    end
    return result
end
julia> mask_pred_int,mask_rand_int = test_branch_pred_int(10_000_000);

julia> @code_llvm branch_pred_int(mask_rand_int)

Then, the generated code is:

[...]

L19:                                              ; preds = %idxend                                           
; └                                                                                                           
;  @ /home/angelv/JuliaAcademy_Parallel_Computing/test.jl:38 within `branch_pred_int'
; ┌ @ float.jl:401 within `+'      
   %15 = fadd double %value_phi3, 1.000000e+00
; └ 
;  @ /home/angelv/JuliaAcademy_Parallel_Computing/test.jl:36 within `branch_pred_int'
; ┌ @ range.jl:593 within `iterate'                                                                       
   br label %L25  

and with a similar block for the result -= 1.0 statement. So in this version there is a branch where the hardware can apply its magic and predict which way things are going.