IF...ELSEIF...ELSE performance

I need to use an IF…ELSEIF…ELSE statement in a large loop and I therefore did the following test to find the fastest implementation. In the following code a random number between 0 and 5 is chosen. Based on this outcome, the outcome is multiplied by 10. I know this is a silly procedure, but it is purely meant for test purposes. The 0 has the largest occurrence, the 5 the smallest. In the code I implement 3 ways to do so: by a dictionary, by an IF statement that starts testing for the 0 and ends with 5 and an IF statement that starts testing for the 5 and ends with the 0. My idea is that, since the 0 is created most often, the first IF statement would be faster than the second. The code looks like this:

function test()

    d = Dict(0 => 0, 1 => 10, 2 => 20, 3 => 30, 4 => 40, 5 => 50)

    a = Float64[]

    for i = 1:30000
        a = Float64[]
        for j = 1:10000
            g = rand(Float64)[1]*6 
            g *= g^2/36
            g = trunc(Int64,g)

            #Option 1
            g = d[g]

            #Option 2
            if g == 0
                g = 0
            elseif g == 1 
                g = 10
            elseif g == 2
                g = 20
            elseif g == 3
                g = 30
            elseif g == 4
                g = 40
            else
                g = 50
            end

            #Option 3
            if g == 5
                g = 50
            elseif g == 4 
                g = 40
            elseif g == 3
                g = 30
            elseif g == 2
                g = 20
            elseif g == 1
                g = 10
            else
                g = 0
            end

            push!(a,g)
        end
    end
    return a
end

a = test()


The push! statement is there since Julia needs to do something with the created numbers. Otherwise, Julia will just skip everything since nothing is done with those number.

For each test I comment two options so that I only use one option. The outcomes are as follows:

Without the three options (as a “zero” measurement")

julia> @time test()

  6.045474 seconds (420.01 k allocations: 7.343 GiB, 10.85% gc time)

With option 1 (dictionary)

julia> @time test()

  9.202082 seconds (420.01 k allocations: 7.343 GiB, 7.29% gc time)

With option 2 (IF with incremental numbers)

julia> @time test()

 10.113319 seconds (420.01 k allocations: 7.343 GiB, 7.89% gc time)

This shows that using a dictionary is faster that using IF statements. Now the test with option 3 (IF with decremental numbers, starting with 5):

julia> @time test()

  9.147968 seconds (420.01 k allocations: 7.343 GiB, 6.56% gc time)

Shorter?!! This is counter intuitive for me because the ‘0’ occurs much more often and, therefore, you would expect that starting checking for zeros in the IF statement would be faster that starting with ‘5’. But the opposite is true!

I have two questions:

  1. Why is this?
  2. What would be the best option if, based on the random number, a specific function needs to be called? I guess you cannot make a dictionary of functions …

Thanks a lot for your help.

PS: I performed the @time at least two times before I marked the outcome.

1 Like

Please ensure that the code is runnable.

1 Like

My post contains the correct code now.

1 Like

I can reproduce those results on my machine.

Note that as written, there’s a type instability with g, but it looks like the compiler is able to avoid costs associated with it. To get cleaner benchmark results, I’d also try to reduce the influence of the garbage collector a bit by pre-allocating a only once, and seeding the random number generator.

I’m not exactly sure why the difference between option 2 and 3 occurs (and it does seem to be reproducible and significant). But note that it’s not necessarily as simple as ‘fewer comparisons == better performance’, e.g. since modern CPUs are quite amazing at predicting which branches are likely to be taken.

2 Likes

I think what you are seeing might partially be due to GC times.
if I use ‘sum’ instead of ‘push!’ I get the timings below where option 3 is slower.
Admittedly, I would have expected the dict to be slower, but it is not.

using Random 
using BenchmarkTools 

function testifelse3()
    Random.seed!(221)
    a_sum=0

    for i = 1:3000
        for j = 1:1000
            g = rand()*6             
            g = trunc(Int64,g*(g^2/36))
            #Option 3
            if g == 5
                a_sum+=50
            elseif g == 4 
                a_sum+= 40
            elseif g == 3
                a_sum+= 30
            elseif g == 2
                a_sum+= 20
            elseif g == 1
                a_sum+= 10
            else
                a_sum+= 0
            end
        end
    end
    return a_sum
end


function testifelse2()
    Random.seed!(221)
    a_sum=0
    for i = 1:3000
        for j = 1:1000
            g = rand()*6             
            g = trunc(Int64,g*(g^2/36))
        #Option 2
            if g == 0
                a_sum+= 0
            elseif g == 1 
                a_sum+= 10
            elseif g == 2
                a_sum+= 20
            elseif g == 3
                a_sum+= 30
            elseif g == 4
                a_sum+= 40
            else
                a_sum+= 50
            end
          
        end
    end
    return a_sum
end

function testdict()
    Random.seed!(221)
    d = Dict(0 => 0, 1 => 10, 2 => 20, 3 => 30, 4 => 40, 5 => 50)
    a_sum=0

    for i = 1:3000
        for j = 1:1000
            g = rand()*6             
            g = trunc(Int64,g*(g^2/36))
            a_sum+= d[g]
        end
    end
    return a_sum
end

a = testifelse3()
b = testifelse2()
d = testdict()

@btime testifelse3() #52ms
@btime testifelse2() #44ms
@btime testdict() #39ms

@benchmark testifelse3() #52ms
@benchmark testifelse2() #44ms
@benchmark testdict() #39ms

1 Like

I was also able to reproduce the not expected difference in performance for option#2 and option#3.
You can see, what happens, when you look at the llvm and native code with @code_llvm and @code_native. I am not so good in interpreting this codes but you can clearly spot the comparisson and that it is not just register operation, compare (cmp) and jumps (jmp). Espcially what i was surprised of was, that there is not 5 compares and an else case (like in the example) but it seems to be optimized to 4 compares, an else case and a branch which is done for all cases. As far as I was able to read the native code (which is not very deep) I think that there lies the reason for the differences.

Here’s another way to write the benchmark

The benchmark loop:

using Random

function dosum(lookup)
    a_sum=0

    for i = 1:3000
        for j = 1:1000
            g = rand()*6             
            g = trunc(Int64,g*(g^2/36))
            a_sum += lookup(g)
        end
    end
    return a_sum
end

Various lookup kernels which have been discussed above:

const d = Dict(0 => 0, 1 => 10, 2 => 20, 3 => 30, 4 => 40, 5 => 50)
function lookup1(g)
    d[g]
end

function lookup2(g)
    if g == 0
        0
    elseif g == 1 
        10
    elseif g == 2
        20
    elseif g == 3
        30
    elseif g == 4
        40
    else
        50
    end
end

function lookup3(g)
    if g == 5
        50
    elseif g == 4 
        40
    elseif g == 3
        30
    elseif g == 2
        20
    elseif g == 1
        10
    else
        0
    end
end

Because the branches here are naturally rather unpredictable and the lookup table consists of fixed values, I thought ifelse might work better. It turns out it does:

function lookup4(g)
    ifelse(g == 5, 50, ifelse(g == 4, 40, ifelse(g == 3, 30, ifelse(g == 2, 20, ifelse(g == 1, 10, 0)))))
end

Results:

julia> using BenchmarkTools 
       @btime dosum(lookup1)
       @btime dosum(lookup2)
       @btime dosum(lookup3)
       @btime dosum(lookup4)

  30.975 ms (0 allocations: 0 bytes) # Dict
  35.940 ms (0 allocations: 0 bytes) # 0 first
  32.290 ms (0 allocations: 0 bytes) # 5 first
  14.177 ms (0 allocations: 0 bytes) # ifelse

ifelse wins by a large margin, presumably because it has removed all the conditional branches. These become a set of conditional moves (cmoveq and cmovneq) on my laptop i7.

Honestly I’m really impressed to see just how well the generic dictionary code does here.

I believe the reason that the expected speed of lookup2 and lookup3 are inverted is as @oheil says: the generated assembly does not happen in the order you wrote, though it’s functionally equivalent. Instead, @code_llvm lookup3(1) shows you that LLVM has reorganized things into a multiway switch for 5,4,3,2, with the cases 1,0 handled in the default clause (and the default case further turns into a conditional move in the native code).

Now, all this messing around with selecting integers may not tell you much about your real use case if you want to dispatch on g to call a set of functions. For that, it would be better to post a more complete example closer to your real code.

10 Likes

You can certainly make a dictionary of functions and looking an individual function up in the dictionary is probably quite fast. Unfortunately, actually calling that function will be relatively slow because you’d have to go through the generic dispatch machinery.

I think we need some more detail about what these functions will contain, whether they need to be updated dynamically, whether the set of values for g are statically known, etc. Without those, it’s hard to give good advice on how to structure this code.

FunctionWrappers should be able to help with that.

2 Likes

Just wanted to mention another approach (which may or may not be useful for your intended use case):

function lookup5(g)
    (0, 10, 20, 30, 40, 50)[g + 1]
end
julia> @btime dosum(lookup4);
  12.960 ms (0 allocations: 0 bytes)

julia> @btime dosum(lookup5);
  8.253 ms (0 allocations: 0 bytes)

Regarding FunctionWrappers, be aware of cfunction issue with FunctionWrappers · Issue #28669 · JuliaLang/julia · GitHub.

6 Likes

I tried all the different methods, and the fastest is definitely using
(0, 10, 20, 30, 40, 50)[g + 1]
which takes about 50 μs compared to about 120 μs for the other methods. I’m also surprised that putting 0 first in the elseif list took longer than putting it last, and the nested ifelse()s was fastest of that category.

But be careful because the slowest method is the similar looking
[0, 10, 20, 30, 40, 50][g + 1]
which takes about 380 μs so bear that in mind!

Also for code golf purposes, it’s nice to define
gen_g() = trunc(Int, 6rand()^3)
and
const d = Dict(i => 10i for i in 0:5)

2 Likes

This will be quite fast if you construct the array once only outside the lookup function. If you make a temporary array each time like written here then it’ll be very slow.

4 Likes

I can confirm the last reply of Chris Foster. If I add the fourth option to the original code:

t = range(0, step = 10, stop = 50) #defined outside the loops

#Option 4 (placed inside the loops)
g = t[g+1]

I measure

julia> @time test()
  7.478701 seconds (420.01 k allocations: 7.343 GiB, 11.25% gc time)

which is much faster than the other three options. Of course the GC time can be reduced to zero, but for this test this is not of importance. Working with an array instead of a tuple is easier since you can easily construct a large array by performing calculations on this array. With tuples this is more cumbersome/impossible (as far as I know).

I still need to find out how to select a method based on the value g. Perhaps with meta programming?

Thanks a lot everybody for the very helpful replies. It is good to know not to be left alone with Julia :grinning:

In julia, code is data (in many different senses). But in this case it applies in a very simple way:

julia> function foo(x)
           x^2
       end

julia> function bar(x)
           x^3
       end

julia> fxns = [foo, bar]
2-element Array{Function,1}:
 foo
 bar

julia> fxns[1](100)
10000

julia> fxns[2](100)
1000000

Applied like this, calling fxns[g](x) is unlikely to be fast, but as I mentioned above, we would need to know several more things about you are trying to do to know whether this will work well enough for you. And if not, what to replace it with.

Yes, I can confirm the speed difference. Simply doing (func0, … are defined somewhere else as function func0(x) return x*x end):

          if g == 0
                g=func0(3)
            elseif g == 1 
                g=func1(3)
            elseif g == 2
                g=func2(3)
            elseif g == 3
                g=func3(3)
            elseif g == 4
                g=func4(3)
            else
                g=func5(3)
            end
            push!(a,g)

Gives:

julia> @time test()
  0.720541 seconds (42.01 k allocations: 751.877 MiB, 11.23% gc time)

and

f = Function[func0,func1,func2,func3,func4,func5] #defined outside the loops

push!(a,f[g+1](3)) #inside the loops

which does the same gives

julia> @time test()
  2.400811 seconds (42.01 k allocations: 751.877 MiB, 3.37% gc time)

Thus 3x slower. Thus the trick of using an array of functions does not give faster results (as compared to the above). I need to use IF statements OR to change my code so that I can avoid the IF statements while going through the loops to make it even faster. I think this issue is not specifically related to Julia.

It’s really hard to know what to suggest without knowing more detail. You’ve abstracted away the real problem so we don’t know

  • What kind and how much work will be done inside func0func5? If it’s a lot of work, no amount of optimizing the branch will make a practical difference.
  • Does g actually take consecutive integer values in practice? (If it were, we could have jumped immediately to the solution of using an array or tuple, without the detour into benchmarking Dict etc).
  • Will the list of functions be constructed on the fly or is it known when writing the code itself?

etc etc.

If you’re at liberty to post the real code, please do so and we can give you relevant advice.

3 Likes

I think a few high-level lessons can be gleaned from this thread:

  1. Dicts are impressively fast.
  2. What LLVM does with an if/else nest is opaque and unpredictable. If that’s the only way to express your computation, so be it, but if there’s a more structured way, you’re probably better off preserving and exposing the structure.
  3. Tuples really rule for performance.

To a large extent, tuples exist to hint to the compiler that it should reason about the individual types and number of elements in an immutable collection. Otherwise we’d just use Vector{Any}—but that type serves as a hint of precisely the opposite: it tells the compiler not to worry about how many or what the specific types in a collection are.

7 Likes