Why isn't 10^6 evaluated at compile time?

I was optimizing some code in the hot path and was surprised to see that constants written as 10^6 were being evaluated at run-time. Isn’t evaluating constants like this one of the simplest optimizations? Why wouldn’t Julia simplify it?

julia> function f(x)
        x*10^6
       end
f (generic function with 1 method)

julia> @code_native f(1)
        .text
; ┌ @ REPL[27]:1 within `f'
        pushq   %rbx
        movq    %rdi, %rbx
; │ @ REPL[27]:2 within `f'
; │┌ @ none within `literal_pow'
; ││┌ @ none within `macro expansion'
; │││┌ @ intfuncs.jl:273 within `^'
        movabsq $power_by_squaring, %rax
        movl    $10, %edi
        movl    $6, %esi
        callq   *%rax
; │└└└
; │┌ @ int.jl:87 within `*'
        imulq   %rbx, %rax
; │└
        popq    %rbx
        retq
; └

Because 1e6 exists.

Sorry, I’m not familiar with this. If one uses 1e6 instead of 10^6 is that number then pre-compiled and if so does that speed up the running of the code?

This isn’t the answer, and it would lead to a function with different behavior, it would always produce floats with at least 64 bit precision.

I’m afraid I don’t know the answer, @Ian_Slagle, but you could use 1_000_000 instead (the underscores are optional).

BTW I may not actually make a difference at runtime, appearances notwithstanding. You should benchmark with BenchmarkTools.jl to tell for sure.

I think the assembler code is evidence enough, but here you go:

julia> using BenchmarkTools

julia> function f1(x)
           x*10^6
       end
f1 (generic function with 1 method)

julia> function f2(x)
           x*1e6
       end
f2 (generic function with 1 method)

julia> @btime f1(x)  setup=(x=rand()) evals=1;
  32.000 ns (0 allocations: 0 bytes)

julia> @btime f2(x)  setup=(x=rand()) evals=1;
  29.000 ns (0 allocations: 0 bytes)

1e6 is a float, you want to test x*10^6 against x*1000000;

You’re right…

No difference though:

julia> function f2(x)
           x*1_000_000
       end
f2 (generic function with 1 method)

julia> @btime f2(x)  setup=(x=rand()) evals=1;
  29.000 ns (0 allocations: 0 bytes)

It seems like there is no purity guarantee for that method.

hmmm, you’re calling it with the output of rand() which is a float, so it’s all going to be converted to float anyway. So if you’re using floats anyway 1e6 is your best bet.

I didn’t tell the full story, it’s (almost) the same when calling with Int64 :wink: I just assumed the usual call of the original function is a float…

julia> @btime f1(x)  setup=(x=rand(Int64)) evals=1;
  32.000 ns (0 allocations: 0 bytes)

julia> @btime f2(x)  setup=(x=rand(Int64)) evals=1;
  28.000 ns (0 allocations: 0 bytes)

So, it basically saves you ~ 10% of the function eval to use a fixed constant. when the function is a trivial multiply function. But of course most people write functions where much more occurs than multiplying by a constant. When the function has say 11 steps, the calculation of a constant at the beginning will involve maybe 1% instead of 10%.

This doesn’t make a lot of sense to me. Raising an integer to the 6th power takes ~10% as long as a single multiplication? Really?

I am also wondering but I think the way I benchmarked it is ok. Any ideas?

There is definitely something “more” happening in the first function, not sure what :wink:

Okay so what is the upside for loosing that 1%? I mean what does the compiler gain in not converting it to a constant? Granted that assumes this was done for a reason. Maybe the compiler developers feel that we should multiple our own *** ***** constants? :slight_smile:

I don’t have an explanation. I just wanted to point out that reading the assembly can be a bit confusing sometimes, so benchmarking is a good idea, no matter how it looks.

1 Like

It’s a good question about why it’s not statically precalculated. It seems likely this is just a kind of optimization that doesn’t yet exist or the compiler didn’t catch it in this case.

I’ve seen this before. There’s simply a limit to how far this literal is calculated:

julia> foo(x) = x * 10^5;

julia> bar(x) = x * 10^4;

julia> @code_native foo(1)
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ REPL[54]:1 within `foo'
	movl	$10, %eax
	movl	$2, %ecx
	nopw	(%rax,%rax)
; │┌ @ none within `literal_pow'
; ││┌ @ none within `macro expansion'
; │││┌ @ intfuncs.jl:238 within `^'
; ││││┌ @ intfuncs.jl:226 within `power_by_squaring'
; │││││┌ @ int.jl:54 within `*'
L16:
	imulq	%rax, %rax
; │││││└
; │││││ @ intfuncs.jl:225 within `power_by_squaring'
; │││││┌ @ operators.jl:341 within `>='
; ││││││┌ @ int.jl:410 within `<='
	addq	$-1, %rcx
; │││││└└
	jg	L16
	imulq	%rdi, %rax
; │└└└└
; │┌ @ int.jl:54 within `*'
	addq	%rax, %rax
	leaq	(%rax,%rax,4), %rax
; │└
	retq
	nopw	%cs:(%rax,%rax)
; └

julia> @code_native bar(1)
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ REPL[55]:1 within `bar'
; │┌ @ REPL[55]:1 within `*'
	imulq	$10000, %rdi, %rax      ## imm = 0x2710
; │└
	retq
	nopl	(%rax,%rax)
; └

So it works for 10^4 but not 10^5 and up.

2 Likes

Also, I’m not sure about the benchmarking here. I tried the Ref trick instead of evals=1, since that will probably be limited by some minimum timing quantization.

julia> x = 3
3

julia> @btime foo($(Ref(x))[]);
  3.180 ns (0 allocations: 0 bytes)

julia> @btime bar($(Ref(x))[]);
  1.456 ns (0 allocations: 0 bytes)

This seems more reasonable for a simple multiplication.

So, the answer, I think, is that, yes, there is a difference in performance, and it’s due to a limitation in how far constant propagation goes for certain operations like power_by_squaring.

And you can solve it by using a different literal:

julia> baz(x) = x * 1_000_000;

julia> @btime baz($(Ref(x))[])
  1.456 ns (0 allocations: 0 bytes)

Yep, good catch. I switched to the eveals=1 since it has less noise but now I revise :wink:

Here’s some more fun:

julia> foo(x) = x * 10^8;

julia> @code_native foo(1)
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ REPL[95]:1 within `foo'
; │┌ @ REPL[95]:1 within `*'
	imulq	$100000000, %rdi, %rax  ## imm = 0x5F5E100
; │└
	retq
	nopl	(%rax,%rax)
; └

julia> foo(x) = x * 10^16;

julia> @code_native foo(1)
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ REPL[97]:1 within `foo'
	movabsq	$10000000000000000, %rax ## imm = 0x2386F26FC10000
; │┌ @ int.jl:54 within `*'
	imulq	%rdi, %rax
; │└
	retq
	nop
; └

julia> foo(x) = x * 10^32;

julia> @code_native foo(1)   # ooooops. overflow
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ REPL[99]:1 within `foo'
	movabsq	$-8814407033341083648, %rax ## imm = 0x85ACEF8100000000
; │┌ @ int.jl:54 within `*'
	imulq	%rdi, %rax
; │└
	retq
	nop
; └
2 Likes

Yeah, this is sometimes a little annoying, but one option you always have is to write a macro that just evals an expression inside the macro body and then never use it on expressions involving local variables and/or impure operations:

macro eval_at_parse_time(ex)
    eval(ex)
end

let x = Ref(4)
    f1(x) = x * 10^6
    f2(x) = x * 1_000_000
    f3(x) = x * @eval_at_parse_time 10^6

    @btime f1($x[])
    @btime f2($x[])
    @btime f3($x[])
end
3.359 ns (0 allocations: 0 bytes)
1.299 ns (0 allocations: 0 bytes)
1.299 ns (0 allocations: 0 bytes)
2 Likes