Why does exporting a single Float64 from a function take as long as 2,000,000 addition steps

damonturney · May 8, 2019, 7:18pm

Hi Forum,

Can someone enlighten me as to why a single Float64 export from inside a function takes as much computation time as 2,000,000 additions? See code below from Julia 1.1.

module test
   struct teststruct
       x::Array{Float64,1}
   end
   data=teststruct(rand(1000))
   function sum_arg(data)
       s = 0.0
       for i in data.x
           for i in data.x
               s += i
           end
       end
   end
   using BenchmarkTools
   timer() = @time sum_arg(data)
   timer()
   timer()
   timer()
   timer()
end
OUTPUTS:
  0.008286 seconds (11.14 k allocations: 527.958 KiB)
  0.000303 seconds
  0.000302 seconds
  0.000302 seconds

and compare that computation time with the computation time seen below:

module test
   struct teststruct
       x::Array{Float64,1}
   end
   data=teststruct(rand(1000))
   function sum_arg(data)
       s = 0.0
       for i in data.x
           for i in data.x
               s += i
           end
       end
       data.x[1]=s
   end
   using BenchmarkTools
   timer() = @time sum_arg(data)
   timer()
   timer()
   timer()
   timer()
end
OUTPUTS:
  0.012986 seconds (11.54 k allocations: 548.084 KiB)
  0.000886 seconds (2 allocations: 32 bytes)
  0.000884 seconds (2 allocations: 32 bytes)
  0.000884 seconds (2 allocations: 32 bytes)

and compare the above computation time to the computation time below:

module test
   struct teststruct
       x::Array{Float64,1}
   end
   data=teststruct(rand(1000))
   function sum_arg(data)
       s = 0.0
       for i in data.x
           for i in data.x
               s += i
           end
       end
       return s
   end
   using BenchmarkTools
   timer() = @time sum_arg(data)
   timer()
   timer()
   timer()
   timer()
end
OUTPUTS:
  0.012986 seconds (11.54 k allocations: 548.084 KiB)
  0.000886 seconds (2 allocations: 32 bytes)
  0.000884 seconds (2 allocations: 32 bytes)
  0.000884 seconds (2 allocations: 32 bytes)

I also tried another version wherein I swapped out “return s” for “a=1.0”, and the computation time was back down to 0.0003 seconds. Therefore the function sum_arg() cannot output it’s value without incurring a time penalty greater than 2,000,000 addition steps? Please enlighten me.

kristoffer.carlsson · May 8, 2019, 7:32pm

Your first function doesn’t do anything so the optimizer can freely replace the whole body with return nothing. I’m not sure this is exactly what is happening here but you need to be careful when you benchmark.

rdeits · May 8, 2019, 7:33pm

I think this simply isn’t a very useful benchmark. The first version of your function returns nothing and has no side effects. Thus, there is no reason that the compiler needs to actually execute it, so the amount of time it takes is not informative.

JeffreySarnoff · May 8, 2019, 7:34pm

fyi with BenchmarkTools use @btime sum_arg($data) and, assigning macros to functions that way may not do what you expect. Just write the macro expression.

JeffreySarnoff · May 8, 2019, 8:23pm

About formatting Julia source code:

three backticks
lines of code indented in your text editor
(or look at / read about Juno – an Atom addin)
three backticks

https://docs.julialang.org/en/v1.1/manual/style-guide/#Style-Guide-1
https://github.com/johnmyleswhite/Style.jl
https://github.com/JuliaPraxis/Naming

damonturney · May 8, 2019, 9:04pm

@kristoffer.carlsson that’s a confusing response Kristoffer.

the “first function” executes 1 million addition steps. So why do you say it “doesn’t do anything”?
Why would “return s” cost 0.6 ms while 1 million addition steps costs 0.3 ms?

damonturney · May 8, 2019, 9:09pm

Hi @rdeits,

Why do you say “there is no reason that the compiler needs to actually execute it” when my (maybe naive) understanding is:

the complier compiles – it doesn’t “execute”.
I thought the @btime macro measures execution duration, not compiling duration. So why are you mentioning the compiler?
maybe you mean to say “there is no reason that the CPU needs to actually execute it” … but then how does the 1 million additions get executed?

Can you please explain clearly why you think it’s not a good benchmark time measurement? That method of “time() = @time sum_arg()” was recommended on this forum by another user. That’s why I used it. I’m happy to use a different benchmark method.

tamasgal · May 8, 2019, 9:12pm

What about this?

function i_crash_your_computer()
    s = 0
    for i in 1:1_000_000_000_000_000_000_000_000_000_000
        s += 1
    end
    return 23
end

> i_crash_your_computer()
23

> @btime i_crash_your_computer()
1.301 ns (0 allocations: 0 bytes)

and the machine code:

> @code_native i_crash_your_computer()

	.text
	movl	$23, %eax
	retq
	nopw	%cs:(%rax,%rax)

As you can see, there is a difference what you write down in your source file and what comes out when the code is turned into machine code by LLVM.

Another example:

function sumup(N)
    s = 0
    for i in 1:N
        s += i
    end
    return s
end

This should do a loop of N iteration and sum up the numbers. However, LLVM figures out that you can do N*(N+1)/2 instead

@code_native sumup(1_000_000_000)

	.text
	testq	%rdi, %rdi
	jle	L32
	leaq	-1(%rdi), %rdx
	leaq	-2(%rdi), %rax
	mulxq	%rax, %rax, %rcx
	shldq	$63, %rax, %rcx
	leaq	(%rcx,%rdi,2), %rax
	addq	$-1, %rax
	retq
L32:
	xorl	%eax, %eax
	retq
	nopw	%cs:(%rax,%rax)

damonturney · May 8, 2019, 9:19pm

@tamasgal I don’t understand how your example codes are relevant. I think maybe you’re missing the point: I executed the exact same code in my two code examples, but with the only difference between them being a “return s” at the end of the function or not. The “return s” create a LONG increase in execution time. Why?

jeff.bezanson · May 8, 2019, 9:26pm

The example is relevant because it shows that when the result of a computation is unused and has no effect, the compiler can remove it completely. The same thing is happening in your benchmark.

tamasgal · May 8, 2019, 9:26pm

Nope, I understand what you don’t understand; you have to wrap your head around.

The compiler is clever. If you write a function which does not return anything, the compiler will create machine code which does exactly nothing. Just like Kristoffer told you in the first reply.
In my examples you see the exact same behaviour. Julia is a high-level language and you usually don’t tell your computer what it has to do “step-by-step”. You write your algorithm and let the compiler do the rest.

rdeits · May 8, 2019, 9:41pm

Just to keep belaboring the point, the observable external effects of the following functions are the same:

function f1()
  x = 0
  for i in 1:1000
    x = x + i
  end
end

function f2()
end

f1 returns nothing and doesn’t modify any external state, and so does f2. The compiler is thus free to transform f1 into f2 if it sees fit to do so, since there is no difference in their effects.

Whether or not your CPU actually executed some particular number of addition instructions is considered to be an implementation detail, not a measurable effect. This is not just a cop-out; it’s actually crucial in Julia or any other programming language: Without that assumption, every single performance improvement in the language would have to be considered a breaking change. That’s obviously not something anyone wants.

Note also that this is a general challenge when benchmarking code in languages with optimizing compilers. You will see similar behavior in C++, for example (depending perhaps on your -O settings).

damonturney · May 8, 2019, 9:57pm

Aha. Thank you Jeff and Tamasgal for explaining it clearly.

damonturney · May 8, 2019, 10:00pm

Aha… thanks to Jeff and Tamasgal for explaining it clearly. So you’re saying the compiler decides NOT to create code to execute the summation that clearly I told it to sum.

Ok, now I understand. I’m a newbie to high performance code, so thank you for enlightening me.

yuyichao · May 8, 2019, 10:05pm

For your reference, in today’s world, the hardware also won’t do exactly what you clearly tell it to do and it has been that way for decades.

damonturney · May 8, 2019, 10:09pm

Got it – thanks.

tkluck · May 8, 2019, 10:13pm

You probably don’t mean it this way, but in a written form this comes across as making a point of not thanking @tamasgal. He actually went through considerable effort to give examples illustrating what’s going on. Credit to both Jeff and Tamas (and others in the thread) seems in order

damonturney · May 8, 2019, 10:17pm

Ya, when I saw his suggested code of “i_crash_your_computer()”, I got a little bit worried that he was trying to get me to enter some dangerous code into my computer… so his response weirded me out a little bit. Then his quick jump to machine code was confusing because I haven’t worked with machine code since 1998, so it was all so confusing that I couldn’t gain any insight from it. Now that I understand better, I definitely appreciate his post, and will edit my responses accordingly.

tamasgal · May 8, 2019, 10:21pm

I agree, the function name might sound scary, but that’s why I executed it on my machine and showed that it’s ok

No worries, all good! Have fun with Julia

DNF · May 8, 2019, 10:44pm

BTW. In case you missed it, your code actually uses the @time macro, not the @btime macro. The latter is located in BenchmarkTools.jl, while the former is in Base. You should use @btime.

Topic		Replies	Views
Shouldn't 1.8.0 be faster than Julia 1.7? Performance	30	2544	September 16, 2022
Compare julia sum to a cpp implementation - julia is extremely slow?! Performance question	35	1795	October 7, 2019
Why the function sum1 is faster than builtin sum Performance question , profiling	14	1041	September 21, 2023
Huge performance improvement by separating function? General Usage	15	1155	August 27, 2018
Compiler optimization for variables and functions Performance	16	1424	September 5, 2018

Why does exporting a single Float64 from a function take as long as 2,000,000 addition steps

Related topics