Why does exporting a single Float64 from a function take as long as 2,000,000 addition steps

#1

Hi Forum,

Can someone enlighten me as to why a single Float64 export from inside a function takes as much computation time as 2,000,000 additions? See code below from Julia 1.1.

module test
   struct teststruct
       x::Array{Float64,1}
   end
   data=teststruct(rand(1000))
   function sum_arg(data)
       s = 0.0
       for i in data.x
           for i in data.x
               s += i
           end
       end
   end
   using BenchmarkTools
   timer() = @time sum_arg(data)
   timer()
   timer()
   timer()
   timer()
end
OUTPUTS:
  0.008286 seconds (11.14 k allocations: 527.958 KiB)
  0.000303 seconds
  0.000302 seconds
  0.000302 seconds

and compare that computation time with the computation time seen below:

module test
   struct teststruct
       x::Array{Float64,1}
   end
   data=teststruct(rand(1000))
   function sum_arg(data)
       s = 0.0
       for i in data.x
           for i in data.x
               s += i
           end
       end
       data.x[1]=s
   end
   using BenchmarkTools
   timer() = @time sum_arg(data)
   timer()
   timer()
   timer()
   timer()
end
OUTPUTS:
  0.012986 seconds (11.54 k allocations: 548.084 KiB)
  0.000886 seconds (2 allocations: 32 bytes)
  0.000884 seconds (2 allocations: 32 bytes)
  0.000884 seconds (2 allocations: 32 bytes)

and compare the above computation time to the computation time below:

module test
   struct teststruct
       x::Array{Float64,1}
   end
   data=teststruct(rand(1000))
   function sum_arg(data)
       s = 0.0
       for i in data.x
           for i in data.x
               s += i
           end
       end
       return s
   end
   using BenchmarkTools
   timer() = @time sum_arg(data)
   timer()
   timer()
   timer()
   timer()
end
OUTPUTS:
  0.012986 seconds (11.54 k allocations: 548.084 KiB)
  0.000886 seconds (2 allocations: 32 bytes)
  0.000884 seconds (2 allocations: 32 bytes)
  0.000884 seconds (2 allocations: 32 bytes)

I also tried another version wherein I swapped out “return s” for “a=1.0”, and the computation time was back down to 0.0003 seconds. Therefore the function sum_arg() cannot output it’s value without incurring a time penalty greater than 2,000,000 addition steps? Please enlighten me.

#2

Your first function doesn’t do anything so the optimizer can freely replace the whole body with return nothing. I’m not sure this is exactly what is happening here but you need to be careful when you benchmark.

1 Like
#3

I think this simply isn’t a very useful benchmark. The first version of your function returns nothing and has no side effects. Thus, there is no reason that the compiler needs to actually execute it, so the amount of time it takes is not informative.

1 Like
#4

fyi with BenchmarkTools use @btime sum_arg($data) and, assigning macros to functions that way may not do what you expect. Just write the macro expression.

1 Like
#5

About formatting Julia source code:

three backticks
lines of code indented in your text editor
(or look at / read about Juno – an Atom addin)
three backticks

https://docs.julialang.org/en/v1.1/manual/style-guide/#Style-Guide-1


1 Like
#6

@kristoffer.carlsson that’s a confusing response Kristoffer.

  1. the “first function” executes 1 million addition steps. So why do you say it “doesn’t do anything”?

  2. Why would “return s” cost 0.6 ms while 1 million addition steps costs 0.3 ms?

#7

Hi @rdeits,

Why do you say “there is no reason that the compiler needs to actually execute it” when my (maybe naive) understanding is:

  1. the complier compiles – it doesn’t “execute”.

  2. I thought the @btime macro measures execution duration, not compiling duration. So why are you mentioning the compiler?

  3. maybe you mean to say “there is no reason that the CPU needs to actually execute it” … but then how does the 1 million additions get executed?

Can you please explain clearly why you think it’s not a good benchmark time measurement? That method of “time() = @time sum_arg()” was recommended on this forum by another user. That’s why I used it. I’m happy to use a different benchmark method.

#8

What about this?

function i_crash_your_computer()
    s = 0
    for i in 1:1_000_000_000_000_000_000_000_000_000_000
        s += 1
    end
    return 23
end
> i_crash_your_computer()
23
> @btime i_crash_your_computer()
1.301 ns (0 allocations: 0 bytes)

and the machine code:

> @code_native i_crash_your_computer()

	.text
	movl	$23, %eax
	retq
	nopw	%cs:(%rax,%rax)

As you can see, there is a difference what you write down in your source file and what comes out when the code is turned into machine code by LLVM.

Another example:

function sumup(N)
    s = 0
    for i in 1:N
        s += i
    end
    return s
end

This should do a loop of N iteration and sum up the numbers. However, LLVM figures out that you can do N*(N+1)/2 instead :wink:

@code_native sumup(1_000_000_000)

	.text
	testq	%rdi, %rdi
	jle	L32
	leaq	-1(%rdi), %rdx
	leaq	-2(%rdi), %rax
	mulxq	%rax, %rax, %rcx
	shldq	$63, %rax, %rcx
	leaq	(%rcx,%rdi,2), %rax
	addq	$-1, %rax
	retq
L32:
	xorl	%eax, %eax
	retq
	nopw	%cs:(%rax,%rax)
11 Likes
#9

@tamasgal I don’t understand how your example codes are relevant. I think maybe you’re missing the point: I executed the exact same code in my two code examples, but with the only difference between them being a “return s” at the end of the function or not. The “return s” create a LONG increase in execution time. Why?

#10

The example is relevant because it shows that when the result of a computation is unused and has no effect, the compiler can remove it completely. The same thing is happening in your benchmark.

4 Likes
#11

Nope, I understand what you don’t understand; you have to wrap your head around. :wink:

The compiler is clever. If you write a function which does not return anything, the compiler will create machine code which does exactly nothing. Just like Kristoffer told you in the first reply.
In my examples you see the exact same behaviour. Julia is a high-level language and you usually don’t tell your computer what it has to do “step-by-step”. You write your algorithm and let the compiler do the rest.

1 Like
#12

Just to keep belaboring the point, the observable external effects of the following functions are the same:

function f1()
  x = 0
  for i in 1:1000
    x = x + i
  end
end

function f2()
end

f1 returns nothing and doesn’t modify any external state, and so does f2. The compiler is thus free to transform f1 into f2 if it sees fit to do so, since there is no difference in their effects.

Whether or not your CPU actually executed some particular number of addition instructions is considered to be an implementation detail, not a measurable effect. This is not just a cop-out; it’s actually crucial in Julia or any other programming language: Without that assumption, every single performance improvement in the language would have to be considered a breaking change. That’s obviously not something anyone wants.

Note also that this is a general challenge when benchmarking code in languages with optimizing compilers. You will see similar behavior in C++, for example (depending perhaps on your -O settings).

4 Likes
#13

Aha. Thank you Jeff and Tamasgal for explaining it clearly.

#14

Aha… thanks to Jeff and Tamasgal for explaining it clearly. So you’re saying the compiler decides NOT to create code to execute the summation that clearly I told it to sum.

Ok, now I understand. I’m a newbie to high performance code, so thank you for enlightening me.

2 Likes
#15

For your reference, in today’s world, the hardware also won’t do exactly what you clearly tell it to do and it has been that way for decades.

2 Likes
#16

Got it – thanks.

#17

You probably don’t mean it this way, but in a written form this comes across as making a point of not thanking @tamasgal. He actually went through considerable effort to give examples illustrating what’s going on. Credit to both Jeff and Tamas (and others in the thread) seems in order :slight_smile:

1 Like
#18

Ya, when I saw his suggested code of “i_crash_your_computer()”, I got a little bit worried that he was trying to get me to enter some dangerous code into my computer… so his response weirded me out a little bit. Then his quick jump to machine code was confusing because I haven’t worked with machine code since 1998, so it was all so confusing that I couldn’t gain any insight from it. Now that I understand better, I definitely appreciate his post, and will edit my responses accordingly.

2 Likes
#19

I agree, the function name might sound scary, but that’s why I executed it on my machine and showed that it’s ok :wink:

No worries, all good! Have fun with Julia :slight_smile:

2 Likes
#20

BTW. In case you missed it, your code actually uses the @time macro, not the @btime macro. The latter is located in BenchmarkTools.jl, while the former is in Base. You should use @btime.