If control flow 40X slow down


#1

Hi,

By adding some work in the if statement (test2 function), even though the evaluation always false and so nothing has been done in the if block, I got 40X slow down compare to real nothing (test) in the if blocks.

function test(m::Array{Int8,2})
    n = size(m)[1]
    progress = 0
    @inbounds for i in 1:n, j in i:n
        nonoverlap = true
        progress+=1
        if progress%100000000==0
            #do nothing
        end
    end
end
function test2(m::Array{Int8,2})
    n = size(m)[1]
    progress = 0
    @inbounds for i in 1:n, j in i:n
        nonoverlap = true
        progress+=1
        if progress%100000000==0
            @printf("%d\n",progress)
        end
    end
end

m=convert(Array{Int8,2}, rand(0:1, 22, 1000))
benchmark test(m) and test(m2) results are attached. I am confused what are the extra running time comes from. Thanks.
image
image


#2

Probably because the compiler is smart enough to elide most of test, but not test2:

julia> @code_native test(a)
	.section	__TEXT,__text,regular,pure_instructions
; Function test {
; Location: REPL[1]:2
	decl	%eax
	movl	%esi, -8(%esp)
; Location: REPL[1]:8
	decl	%eax
	movl	$174055432, %eax        ## imm = 0xA5FE008
	addl	%eax, (%eax)
	addb	%al, (%eax)
	retl
;}

vs test2:
https://pastebin.com/Z2JmTd1L (was too long to paste here - also I changed @printf to printf.)


#3

I see that you have tried to quote your code, but it’s not quite right. Please take a look at this post to see how to format code for discourse: PSA: how to quote code with backticks

You should also remember to interpolate variables when using BenchmarkTools: @benchmark test($m), instead of @benchmark test(m). It may not make a difference here, but in general, it does, and it removes the need to double-check if performance issues are due to the lack of interpolation.


#4

When benchmarking it is important to be able to do some napkin math to figure out if the results one get are at all reasonable.

For example, here you have a double loop going doing 22 * 22 ~ 500 iterations. As a very rough estimate, we can say that a CPU can do 1 integer addition per cycle. So a 3 GHz CPU should be able to do 500 iterations in

500 / (3 * 10^9) * 10^9 ns which is approximately 170 ns. So clearly, getting a result of 4 ns is just not reasonable.


#5

Thank you for the link! I was looking for that.

function test(m::Array{Int8,2})
    n = size(m)[1]
    progress = 0
    @inbounds for i in 1:n, j in i:n
        nonoverlap = true
        progress+=1
        if progress%100000000==0
            #do nothing
        end
    end
end
function test2(m::Array{Int8,2})
    n = size(m)[1]
    progress = 0
    @inbounds for i in 1:n, j in i:n
        nonoverlap = true
        progress+=1
        if progress%100000000==0
            @printf("%d\n",progress)
        end
    end
end

#6

There is nothing going on inside this block, therefore it is not necessary to evaluate the condition. And therefore it is not necessary to evaluate the loop either. The whole function is basically optimized away.

Even though it’s alway false, you don’t know that a priori, so you still have to evaluate the condition at every step of the loop.


#7

I change the block to

if progress%100000000==0
    1+1
end
'''
it still gives me ~5ns result.

#8

Further problems:

  1. progress is not returned, so test2(m::Array{Int8, 2}) = nothing is an equivalent function (if the @printf is removed).

  2. The optimizer knows math so it can just figure out what the end result of progress from the loop ranges.


#9

1+1 also doesn’t do anything visible from the caller. A compiler is free to optimize as much as it wants as long as it doesn’t change any observable result. This is known as the “as if” rule.


#10

so it does not run the for loop at all?


#11

The @printf macro generates a ton of inline code (which is not great and should be fixed), you could try putting it in its own function and see if that helps at all.


#12

A trivial loop can easily be optimized away entirely. This can happen even in non-trivial cases:

function range_sum(a::Int, b::Int)
    s = 0
    for i = min(a,b):max(a,b)
        s += i
    end
    return s
end
julia> @code_llvm range_sum(-10, 10)

define i64 @julia_range_sum_220933050(i64, i64) {
  %2 = icmp slt i64 %1, %0
  %3 = select i1 %2, i64 %0, i64 %1
  %4 = select i1 %2, i64 %1, i64 %0
  %5 = sub i64 %3, %4
  %6 = add i64 %3, -1
  %7 = sub i64 %6, %4
  %8 = zext i64 %7 to i65
  %9 = zext i64 %5 to i65
  %10 = mul i65 %8, %9
  %11 = lshr i65 %10, 1
  %12 = trunc i65 %11 to i64
  %13 = add i64 %4, 1
  %14 = mul i64 %5, %13
  %15 = add i64 %4, %14
  %16 = add i64 %15, %12
  ret i64 %16
}

The function is computed in closed form, no loops. Compilers are very good at integer math and know how to simplify basic integer summation problems.