Confusing performance results from simple conditional code

I am writing a toy game that uses very simple keyboard commands. I need to do a lot of checking to see if certain keys are pressed and I want to ignore case so I do a lot of things like this:

if keycode == 'c' || keycode == 'C'
    blah blah
end

That is ugly and a lot of typing if I have more than a few keys I want to check against. So I tried this for brevity:

if keycode in Set(['c', 'C'])
     blah blah
end

They both work fine but I was curious about performance so I ran a simple speed check (full code at the bottom)

I found that the “||” version was 6-7 times faster than the “Set” version ( I controlled for the || case exiting early, see code below)

So, I decided to try to code a macro to write the “||” version a bit more succinctly. Well that didn’t end well, so I decided to make a “multi_or” function that took in the test variable and the values to check against and returned a Bool (again, see below). That version was by far the fastest of the three, ~800 times faster than the “Set” version and ~140 times faster than the “||” version.

That really confused me. Can anyone hazard a guess at why the function multi_or is so much faster than the plain “||” code?

Here are test results:

julia> x=rand(100_000_000);
julia> include("speedcheck.jl")

julia> @time set_test(x)
 17.177306 seconds (500.00 M allocations: 40.233 GiB, 18.02% gc time)
49999593

julia> @time or_test(x)
  2.799505 seconds
49999593

julia> @time multi_or_test(x)
  0.020058 seconds
49999593

speedcheck.jl:

"""Call these tests with a random vector of floats to prevent the compiler 
from trying to outsmart the test"""

function set_test(vec::Vector{Float64})
    c::Char = 'g' 
    sum::Int64 = 0
    for val in vec
        if val > 0.5
            c = 'C'
        else
            c = 'K'
        end
        #put the successful test value at the end to makes things worse
        if c in Set(['c', 'a', 'b', 'd', 'e', 'f', 'F', 'C']) 
            sum += 1
        end
    end
    return sum
end



function or_test(vec::Vector{Float64})
    c::Char = 'g'
    sum::Int64 = 0
    for val in vec
        if val > 0.5
            c = 'C'
        else
            c = 'K'
        end
        #put the successful test value at the end to makes things worse
        if c == 'c' || c == 'a' || c == 'b' || c == 'd' || c == 'e' || c == 'f' || c == 'F' || c == 'C' 
            sum += 1
        end
    end
    return sum
end

function multi_or(val, choices...)::Bool
    for choice in choices
        if val == choice
            return true
        end
    end
    return false
end


function multi_or_test(vec::Vector{Float64})
    c::Char = 'g'
    sum::Int64 = 0
    for val in vec
        if val > 0.5
            c = 'C'
        else
            c = 'K'
        end
        #put the successful test value at the end to makes things worse
        if multi_or(c, 'c', 'a', 'b', 'd', 'e', 'f', 'F', 'C') 
            sum += 1
        end
    end
    return sum
end
'''
1 Like

The “Set” version allocates memory, the other one does not.

julia> @allocations if keycode == 'c' || keycode == 'C'
           0
       end
0

julia> @allocations if keycode in Set(['c','C'])
           0
       end
31

julia> 
1 Like

An easier comparison that does not allocate and is compact:

if c in "cabdefFC"
  sum += 1
end

Potentially even easier: first convert c to lower case.

1 Like

Just to say that I tested the OP codes and what he asks is not related to the allocations. You can replace the set by a tuple and the result is the same. It was not obvious for me why one version is much faster, but it is.

I think it boils down to in being slower than a simple search for the first element of a collection (why?):

julia> @btime in('c',('a','b','c','d','e'))
  2.801 ns (0 allocations: 0 bytes)
true

julia> @btime findfirst(==('c'),('a','b','c','d','e'))
  1.692 ns (0 allocations: 0 bytes)
3
3 Likes

It doesn’t look like the provided multi_or('a','b','c') is really different from the implementation of in('a',('b','c')), so I’m surprised it performs differently. Nevertheless, I do see a massive performance difference between the two.

I also tried any(==(c), ('c','a','b','d','e','f','F','C')), which did not perform any better. But any(c .== ('c','a','b','d','e','f','F','C')) was faster than anything offered yet. This is because, for relatively small sets, it’s faster to just evaluate all comparisons and check for any to be true rather than to check each one sequentially and exit when you find one. This is because the “do all comparisons first” method can use SIMD while the early exit version requires extensive branching.

1 Like

is slower as it also handles missing while findfirst doesn’t. Specifically:

julia> in('c',('a','b',missing,'d','e'))
missing
4 Likes

Yes, although it is impossible for isequal(::Char,::Char) to produce a ::Missing result, so all processing of missings should compile away as dead code. It’s only when a missing is possible is that any missing handling is required.

Interesting. I will admit not having run OP’s code. But I’m currently working on a somewhat complex calculation that I run millions of time, and a lot of the slowdown comes from allocating memory. To illustrate the difference with a simple example:

julia> @btime for i in 1:10_000
           if keycode in ['c','C']
               0
           end
       end
  831.126 Îźs (10000 allocations: 625.00 KiB)

julia> @btime for i in 1:10_000
           if keycode == 'c' || keycode == 'C'
               0
           end
       end
  395.448 Îźs (0 allocations: 0 bytes)

julia> 
1 Like

Sorry for the multi-post. The below seems to indicate that “in” is not itself the issue.

julia> myset = ['c','C']
2-element Vector{Char}:
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'C': ASCII/Unicode U+0043 (category Lu: Letter, uppercase)

julia> @btime for i in 1:10_000
           if keycode in myset
               0
           end
       end
  217.037 Îźs (0 allocations: 0 bytes)

julia> ```

This is benchmarking in global scope. If you could hoist the ['c','C'] outside of the loop you would only allocate it once.

julia> @btime let cC = ['c','C']
           for i in 1:10_000
               if 'x' in cC
                   0
               end
           end
       end
  21.565 ns (1 allocation: 64 bytes)

But in general you should benchmark things inside of functions. In a function it wouldn’t (shouldn’t?) keep reallocating the vector for every loop iteration still would allocate on every iteration (see countC_bad), which is unfortunate. Also notice that this function “ran” 10000 iterations in 22ns, which is totally bogus. This loop was compiled away to nothing because it didn’t actually do anything. All it does is allocate and then drop the cC vector.

Here’s a better test

julia> function countC_bad(x)
           numc = 0
           for xx in x
               numc += in(xx,['c','C'])
           end
           return numc
       end
countC_bad (generic function with 1 method)

julia> function countC(x)
           numc = 0
           cC = ['c','C']
           for xx in x
               numc += in(xx,cC)
           end
           return numc
       end
countC (generic function with 1 method)

julia> function countC_tuple(x)
           numc = 0
           cC = ('c','C')
           for xx in x
               numc += in(xx,cC)
           end
           return numc
       end
countC_tuple (generic function with 1 method)

julia> x = rand('A':'z',10000);

julia> @btime countC_bad($x)
  218.100 Îźs (10000 allocations: 625.00 KiB)
297

julia> @btime countC($x)
  12.600 Îźs (1 allocation: 64 bytes)
297

julia> @btime countC_tuple($x)
  1.200 Îźs (0 allocations: 0 bytes)
297

Note that the reason the tuple version is so much faster is that it knows its checking 2 things and can compile for that specially, whereas the vector version has to be able to loop through an arbitrary number of things because the compiler “forgets” how long the vector is.

I stand corrected, and you are correct that the content of the loop gets optimized away. I realized as I saw you replying.

Still interesting that the allocation occurs at every iteration of the loop.

This website reminds us how costly allocations can be compared to simple operations: http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/

1 Like

I think the benchmarking is just being done incorrectly, I get the following results:

  30.923 s (500000000 allocations: 40.23 GiB)
  32.961 ms (0 allocations: 0 bytes)
  32.918 ms (0 allocations: 0 bytes)

For the following code:

using BenchmarkTools

"""Call these tests with a random vector of floats to prevent the compiler
from trying to outsmart the test"""

function set_test(vec::Vector{Float64})
    c::Char = 'g'
    sum::Int64 = 0
    for val in vec
        if val > 0.5
            c = 'C'
        else
            c = 'K'
        end
        #put the successful test value at the end to makes things worse
        if c in Set(['c', 'a', 'b', 'd', 'e', 'f', 'F', 'C'])
            sum += 1
        end
    end
    return sum
end



function or_test(vec::Vector{Float64})
    c::Char = 'g'
    sum::Int64 = 0
    for val in vec
        if val > 0.5
            c = 'C'
        else
            c = 'K'
        end
        #put the successful test value at the end to makes things worse
        if c == 'c' || c == 'a' || c == 'b' || c == 'd' || c == 'e' || c == 'f' || c == 'F' || c == 'C'
            sum += 1
        end
    end
    return sum
end

function multi_or(val, choices...)::Bool
    for choice in choices
        if val == choice
            return true
        end
    end
    return false
end


function multi_or_test(vec::Vector{Float64})
    c::Char = 'g'
    sum::Int64 = 0
    for val in vec
        if val > 0.5
            c = 'C'
        else
            c = 'K'
        end
        #put the successful test value at the end to makes things worse
        if multi_or(c, 'c', 'a', 'b', 'd', 'e', 'f', 'F', 'C')
            sum += 1
        end
    end
    return sum
end

x = rand(100_000_000)

@btime set_test($x)
@btime or_test($x)
@btime multi_or_test($x)

Reusing your code I get (as a reference):

julia> @btime set_test($x)
  46.904 s (500000000 allocations: 40.23 GiB)
49995927

julia> @btime or_test($x)
  41.695 ms (0 allocations: 0 bytes)
49995927

julia> @btime multi_or_test($x)
  45.049 ms (0 allocations: 0 bytes)
49995927

Now putting the Set creation outside of the function:

julia> myset = Set(['c', 'a', 'b', 'd', 'e', 'f', 'F', 'C'])
Set{Char} with 8 elements:
  'C'
  'f'
  'c'
  'a'
  'd'
  'e'
  'b'
  'F'

julia> function set_test2(vec::Vector{Float64})
           c::Char = 'g'
           sum::Int64 = 0
           for val in vec
               if val > 0.5
                   c = 'C'
               else
                   c = 'K'
               end
               #put the successful test value at the end to makes things worse
               if c in myset
                   sum += 1
               end
           end
           return sum
       end
set_test2 (generic function with 1 method)

julia> @btime set_test2($x)
  4.513 s (0 allocations: 0 bytes)
49995927

Thank you for all the replies, it looks like the slowdown from my set_test was from naively letting the Set be created each time the conditional was executed (this seems like an optimization the compiler should have seen, but hey, I am a newb). Any variant of the in test is slower than using ||.

The difference between the || and the multi_or seems to be an artifact of how @time works. Using @btime gives matching results for both methods.

I don’t yet understand the difference between @time and @btime but I think this mystery is solved.

Now, back to trying to make my @multi_or macro…(just because it is a good exercise, I am starting to like Julia as much as I used to like Lisp)…

5 Likes

I was a little confounded by the performance difference I was seeing. I eventually chased it down to this:

or_chain(c) = c == 'c' || c == 'a' || c == 'b' || c == 'd' || c == 'e' || c == 'f' || c == 'F' || c == 'C'

any_chain(c) = any(==(c), ('c', 'a', 'b', 'd', 'e', 'f', 'F', 'C'))

reduce_chain(c) = mapreduce(==(c), |, ('c', 'a', 'b', 'd', 'e', 'f', 'F', 'C'))

@code_llvm debuginfo=:none or_chain('C')
define i8 @julia_or_chain_5162(i32 zeroext %0) #0 {
top:
  %1 = add i32 %0, -1627389952
  %2 = call i32 @llvm.fshl.i32(i32 %1, i32 %1, i32 8)
  %switch = icmp ult i32 %2, 6
  %3 = icmp eq i32 %0, 1174405120
  %4 = icmp eq i32 %0, 1124073472
  %spec.select = or i1 %3, %4
  %narrow = select i1 %switch, i1 true, i1 %spec.select
  %common.ret.op.in = zext i1 %narrow to i8
  ret i8 %common.ret.op.in
}

Notice that this is not doing a series of comparisons and early exits. The compiler decided it could do better. I’ll paraphrase the assembly above:

%1 = c - 1627389952
%2 = circularly shift %1 8 bits to the left # top byte now at bottom
%switch = %2 < 6 # 'a' <= c <= 'f'
%3 = c == 'F'
%4 = c == 'C'
%spec.select = %3 | %4
%narrow = %switch | %spec.select
%common.ret.op.in = Bool(%narrow)
return %common.ret.op.in

So we see that the code is not doing a branch tree. It checks whether the input c is in the range ‘a’:‘f’ or is equal to ‘F’ or ‘C’. So while this was great for this example, it may not be able to be so clever for other sets of targets.

I won’t go into it here, but if you look at the @code_native any_chain('C') you will see that it does make an if-chain (this is less clear from the @code_llvm but you can look at that too). Meanwhile, @code_llvm reduce_chain('C') shows that it checks each value individually but does not branch, instead or-ing the checks as it goes and simply returning at the end. Branching is expensive, so for somewhat small target sets or ones where you expect to need to check a significant fraction of the values (i.e., won’t usually find a match early in the set), reductions require many fewer branches so can be much faster.

While the any version was slow when I made a comparable test function (probably because of all the branches), the reduce version was comparable to the or_chain version (that was compiled to something much more specialized). So I wouldn’t get too caught up in the @multi_or macro (except as a personal exercise). One of the any or reduce solutions should be a close competitor except when your target is entire ranges of values whose structure can be exploited to hyper-optimize the code.

3 Likes

Thank you for the insight. I hadn’t seen the @code_... macros. Those are handy.

The @multi_or macro has become a quest. I am going to that stupid thing working. Then probably never use it…

1 Like

One last post to tie this up and put a bow on it. I got the @multi_or macro working and implemented all of the various methods described in this post.

Using a 100,000,000 element Float64 vector as input here are the results for the various methods using @benchmark

Method @benchmark results
hand coded chained || 15.8ms
function multi_or 15.8ms
@multi_or 15.8ms
mapreduce() 15.8ms
any() 244ms
in [...] 661ms
in "<chars>" 1400ms
in Set([...]) 17000ms

That is a pretty huge disparity. I would never have though of mapreduce in this context, but it is the clear winner given that it has the least code written equal to the best performance.

Syntactically I like using the @multi_or macro I wrote, but since it took me half a life time to get it working it was only worth the education value and not the time saved in the code.

Thank you to all that contributed. I learned a huge amount and hopefully this thread can help other moving forward.

All of the code in one place:



using BenchmarkTools
x = rand(100_000_000)


# 15.8ms
function or_test(vec::Vector{Float64})
    c::Char = 'g'
    sum::Int64 = 0
    for val in vec
        if val > 0.5
            c = 'C'
        else
            c = 'K'
        end
        if c ==
           'c' || c == 'a' || c == 'b' || c == 'd' || c == 'e' || c == 'f' || c == 'F' || c == 'C'
            sum += 1
        end
    end
    return sum
end

function multi_or(val, choices...)::Bool
    for choice in choices
        if val == choice
            return true
        end
    end
    return false
end

# 15.816ms
function multi_or_test(vec::Vector{Float64})
    c::Char = 'g'
    sum::Int64 = 0
    for val in vec
        if val > 0.5
            c = 'C'
        else
            c = 'K'
        end
        if multi_or(c, 'c', 'a', 'b', 'd', 'e', 'f', 'F', 'C')
            sum += 1
        end
    end
    return sum
end


macro multi_or(val, choices...)
    expr = :($val == $(choices[1]))
    for choice in choices[2:end]
        expr = Expr(:(||), :($val == $(choice)), expr)
    end
    return esc(expr)
end

# 15.831ms
function multi_or_macro_test(vec::Vector{Float64})
    c::Char = 'g'
    sum::Int64 = 0
    for val in vec
        if val > 0.5
            c = 'C'
        else
            c = 'K'
        end
        if @multi_or c 'c' 'a' 'b' 'd' 'e' 'f' 'F' 'C'
            sum += 1
        end
    end
    return sum
end

# 661ms
function set_test3(vec::Vector{Float64})
    c::Char = 'g'
    sum::Int64 = 0
    t_set = ['c', 'a', 'b', 'd', 'e', 'f', 'F', 'C']
    for val in vec
        if val > 0.5
            c = 'C'
        else
            c = 'K'
        end
        if c in t_set
            sum += 1
        end
    end
    return sum
end


# 989ms
function set_test2(vec::Vector{Float64})
    c::Char = 'g'
    sum::Int64 = 0
    t_set = Set(['c', 'a', 'b', 'd', 'e', 'f', 'F', 'C'])
    for val in vec
        if val > 0.5
            c = 'C'
        else
            c = 'K'
        end
        if c in t_set
            sum += 1
        end
    end
    return sum
end

# 1.395s
function string_test(vec::Vector{Float64})
    c::Char = 'g'
    sum::Int64 = 0
    for val in vec
        if val > 0.5
            c = 'C'
        else
            c = 'K'
        end
        if c in "cabdefFC"
            sum += 1
        end
    end
    return sum
end

# 17s
function set_test(vec::Vector{Float64})
    c::Char = 'g'
    sum::Int64 = 0
    for val in vec
        if val > 0.5
            c = 'C'
        else
            c = 'K'
        end
        if c in Set(['c', 'a', 'b', 'd', 'e', 'f', 'F', 'C'])
            sum += 1
        end
    end
    return sum
end

#244ms
function any_test(vec::Vector{Float64})
    c::Char = 'g'
    sum::Int64 = 0
    for val in vec
        if val > 0.5
            c = 'C'
        else
            c = 'K'
        end
        if any(==(c), ('c', 'a', 'b', 'd', 'e', 'f', 'F', 'C'))
            sum += 1
        end
    end
    return sum
end


#15.833ms
function mapreduce_test(vec::Vector{Float64})
    c::Char = 'g'
    sum::Int64 = 0
    for val in vec
        if val > 0.5
            c = 'C'
        else
            c = 'K'
        end
        if mapreduce(==(c), |, ('c', 'a', 'b', 'd', 'e', 'f', 'F', 'C'))
            sum += 1
        end
    end
    return sum
end
4 Likes