# Power function with Integer ( x^4 much slower than x*x*x*x )

#1

Hey there!
I have a question. I need to calculate the following in a hot-loop:
`ω(d, dmax) = (1-(d/dmax)^2)^4`

I benchmarked different versions of this
and found the following

``````using BenchmarkTools

d = rand(10000)
dm = rand(10000)+1

ω1(d, dmax) = (1-(d/dmax)^2)^4
@btime ω1.(\$d, \$dm)

ω2(d, dmax) = (x= d/dmax; (1-x^2)^4 )
@btime ω2.(\$d, \$dm)

ω3(d, dmax) = (x= d/dmax; y = 1-x^2; y^4 )
@btime ω3.(\$d, \$dm)

ω4(d, dmax) = (x= d/dmax; y = 1-x^2; y*y*y*y)
@btime ω4.(\$d, \$dm)
``````

679.700 μs (2 allocations: 78.20 KiB)
682.825 μs (2 allocations: 78.20 KiB)
703.061 μs (2 allocations: 78.20 KiB)
10.637 μs (2 allocations: 78.20 KiB)

I looked at the `@code_native`: ( I started Julia 0.6.2 in O3 mode )

• The `x^2` is unrolled to `x*x` automatically
• `y^4` actually calls the `pow` function, which is why the manually unrolled version is so much faster, I assume.

Can anyone tell me why this is the case?
Shouldn’t the compiler do this by itself or is there a reason why in general this would not be good?

Best,
Jonas

#2

The unrolled multiplication is less accurate than the pow function. If you try some random values you can probably find values where they disagree.

#3

And indeed:

``````ω3_fastmath(d, dmax) = (x= d/dmax; y = 1-x^2; @fastmath y^4 )
@btime ω3_fastmath.(\$d, \$dm)
``````

is as fast as`y*y*y*y`.

#4

From the documentation Performance Tips:

• Use @fastmath to allow floating point optimizations that are correct for real numbers, but lead to differences for IEEE numbers. Be careful when doing this, as this may change numerical results. This corresponds to the -ffast-math option of clang.

Indeed, `@fastmath` version disagrees more times than the unrolled version.

#5

On my computer `x^4` is almost 20 times(!) as slow as `x*x*x*x`. That’s quite dramatic. Is that sort of slowdown expected? Is it documented anywhere, and should it be in the performance tips?

I see from @Iagoba_Apellaniz that it is sort of documented, but the amount of slowdown is to me quite astonishing.

#6

Please post how you benchmark and what julia version. I get a factor 3 difference on master.

#7

Of course:

Version 0.6.2, with `-03` flag.

``````julia> r = rand(1000);

julia> @btime (\$r).^4;
24.955 μs (1 allocation: 7.94 KiB)

julia> @btime (\$r) .* (\$r) .* (\$r) .* (\$r);
1.306 μs (1 allocation: 7.94 KiB)
``````

For scalar:

``````julia> r = rand();

julia> @btime (\$r)^4;
23.066 ns (0 allocations: 0 bytes)

julia> @btime (\$r) * (\$r) * (\$r) * (\$r);
1.739 ns (0 allocations: 0 bytes)
``````

3x slowdown sounds much more reasonable.

#8

On 0.6 for me:

``````julia> f(x) = x*x*x*x
f (generic function with 1 method)

julia> f2(x) = x^4
f2 (generic function with 1 method)

julia> @btime f2(4)
4.585 ns (0 allocations: 0 bytes)
256

julia> @btime f(4)
1.597 ns (0 allocations: 0 bytes)
256
``````

#9

I think the example deals with `Float64` numbers. I also have no issues with `Int` numbers.

UPDATE: Indeed, when using `Int`s the results are the same for all cases, pow, unroll and fastmath. Not yet the same performance

``````julia> x = 2; @btime (\$x)^4; @btime (\$x)*(\$x)*(\$x)*(\$x); @btime @fastmath (\$x)^4

4.044 ns (0 allocations: 0 bytes)
1.736 ns (0 allocations: 0 bytes)
3.743 ns (0 allocations: 0 bytes)
16
``````

#10

For integers unrolling is equivalent.

#11

Yes, there is clearly a smaller difference for `Int`s, as expected.

But there is also a surprising difference here:

``````julia> foo(x) = x^4
foo (generic function with 1 method)

julia> r = rand();

julia> @btime foo(\$r)
24.099 ns (0 allocations: 0 bytes)
0.831660353551182

julia> @btime foo(4.0)
10.595 ns (0 allocations: 0 bytes)
256.0
``````

#12

Oh, actually, that’s a difference here:

``````julia> @btime foo(0.4)
24.348 ns (0 allocations: 0 bytes)
0.025600000000000005

julia> @btime foo(4.0)
10.604 ns (0 allocations: 0 bytes)
256.0

``````

#13

Amusingly it is not as fast for me on 0.7

``````versioninfo()
Julia Version 0.7.0-DEV.5021
Commit bfb1c1b91d* (2018-05-06 00:40 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i5-5###U CPU @ 2.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
``````

So let’s see the reproducer:

``````using BenchmarkTools

d = rand(10000);
dm = rand(10000)+1;

omega_3(d, dmax) = (x= d/dmax; y = 1-x^2; y^4 );
omega_3fm(d, dmax) = (x= d/dmax; y = 1-x^2; @fastmath y^4 );
omega_4(d, dmax) = (x= d/dmax; y = 1-x^2; y*y*y*y);
begin
@btime omega_3.(\$d,\$dm);
@btime omega_3fm.(\$d,\$dm);
@btime omega_4.(\$d,\$dm);
end;
#  799.657 ��s (2 allocations: 78.20 KiB)
#  25.416 ��s (2 allocations: 78.20 KiB)
#  18.386 ��s (2 allocations: 78.20 KiB)

``````

So what’s happening? Let’s look at the `@code_native`, first the good one:

``````@code_native omega_4(2.3,4.5)
.text
; Function omega_4 {
vdivsd	%xmm1, %xmm0, %xmm0

vmulsd	%xmm0, %xmm0, %xmm0
movabsq	\$140486950354952, %rax  # imm = 0x7FC5AAC58008

vmovsd	(%rax), %xmm1           # xmm1 = mem[0],zero
vsubsd	%xmm0, %xmm1, %xmm0

vmulsd	%xmm0, %xmm0, %xmm1
vmulsd	%xmm1, %xmm0, %xmm1

vmulsd	%xmm1, %xmm0, %xmm0

retq
nopw	(%rax,%rax)
;}
``````

``````@code_native omega_3fm(2.3,4.5)
.text
; Function omega_3fm {

vdivsd	%xmm1, %xmm0, %xmm0

vmulsd	%xmm0, %xmm0, %xmm0
movabsq	\$140486950351792, %rax  # imm = 0x7FC5AAC573B0

vmovsd	(%rax), %xmm1           # xmm1 = mem[0],zero
vsubsd	%xmm0, %xmm1, %xmm0
vmovsd	%xmm0, -8(%rsp)

vmulsd	%xmm0, %xmm0, %xmm0
vmulsd	%xmm0, %xmm0, %xmm0
retq
nopl	(%rax)
;}
``````

Who ordered the stupid `vmovsd %xmm0, -8(%rsp)`? We save an entire multiply but waste it by a spurious store? WTF? I’d say that is an LLVM bug / missed optimization opportunity.

But this does not matter in isolation:

``````begin
@btime omega_3fm(\$1.23, \$1.456);
@btime omega_4(\$1.23, \$1.456);
end;
#  2.650 ns (0 allocations: 0 bytes)
#  2.649 ns (0 allocations: 0 bytes)
``````

So let’s look at the code in context:

``````foo_3fm(dst,d,dm) = (dst .= omega_3fm.(d,dm); nothing);
foo_4(dst,d,dm) = (dst .= omega_4.(d,dm); nothing);
dst = similar(d);
@btime foo_3fm(\$dst,\$d,\$dm)
@btime foo_4(\$dst,\$d,\$dm)
#24.188 ��s (0 allocations: 0 bytes)
#17.738 ��s (0 allocations: 0 bytes)
#17.547 ��s (0 allocations: 0 bytes)
``````

So we see that the extra multiplication does not matter at all; we are probably constrained by the division only.
We now have a simpler reproducer, and (oh my god is the new `@code_native` annoyingly verbose, about 2/3 of it is endless unhelpful comments about the provenance of every instruction) can look at the code. We continue to see all the bad writes to the stack (`vmovsd %xmm1, 8(%rsp)`) and it is clear that they don’t matter unless either the division or a write faults. I have no idea what llvm is thinking; is there a way to inspect the stack unwinding code for a div-by-zero? Maybe that’s why llvm does not simply kill the instruction during generation of native code (the optimization would have to be pretty global, changing both the exception handling and the local code, so more difficult to spot).

I don’t really understand why these moves are so expensive, they should never hit main memory (thanks to write combining and register renaming). Maybe someone fire up `IACA.jl`? Or someone who has a better understanding of uarch than me?

``````@code_native foo_3fm(dst,d,dm)
.text
; Function foo_3fm {
; Location: REPL[35]:1
pushq	%rbp
pushq	%r15
pushq	%r14
pushq	%r13
pushq	%r12
pushq	%rbx
subq	\$56, %rsp
vxorpd	%xmm0, %xmm0, %xmm0
vmovapd	%xmm0, 16(%rsp)
movq	\$0, 32(%rsp)
movq	%rsi, 48(%rsp)
movq	%fs:0, %rax
movq	\$2, 16(%rsp)
movq	-10920(%rax), %rcx
movq	%rcx, 24(%rsp)
leaq	16(%rsp), %rcx
movq	%rcx, -10920(%rax)
leaq	-10920(%rax), %rdi
movq	(%rsi), %r12
movq	8(%rsi), %rbx
movq	16(%rsi), %r15

movq	24(%r12), %rbp
movq	%rbp, %rax
sarq	\$63, %rax
andnq	%rbp, %rax, %r13

movq	24(%rbx), %r14

movq	%r14, %rax
sarq	\$63, %rax
andnq	%r14, %rax, %rax

cmpq	%rax, %r13
je	L146

cmpq	\$1, %rax
jne	L665

L146:
movq	24(%r15), %rax

movq	%rax, %rcx
sarq	\$63, %rcx
andnq	%rax, %rcx, %rax

cmpq	%rax, %r13

je	L177
cmpq	\$1, %rax
jne	L734

L177:
cmpq	%rbx, %r12
je	L219

movq	(%r12), %rax
cmpq	(%rbx), %rax

jne	L219

movabsq	\$jl_array_copy, %rax
movq	%rdi, %r14
movq	%rbx, %rdi
callq	*%rax
movq	%r14, %rdi
movq	%rax, %rbx
movq	24(%rbx), %r14

L219:
cmpq	%r15, %r12
je	L266

movq	(%r12), %rax
cmpq	(%r15), %rax
jne	L266

movq	%rbx, 32(%rsp)
movabsq	\$jl_array_copy, %rax
movq	%rdi, 40(%rsp)

movq	%r15, %rdi
callq	*%rax
movq	40(%rsp), %rdi
movq	%rax, %r15

L266:
testq	%r13, %r13
jle	L632

testq	%rbp, %rbp
jle	L632

movq	24(%r15), %rax
cmpq	\$1, %r14
jne	L379

cmpq	\$1, %rax
jne	L476

xorl	%eax, %eax
movabsq	\$140486950383608, %rcx  # imm = 0x7FC5AAC5EFF8
vmovsd	(%rcx), %xmm0           # xmm0 = mem[0],zero
L320:
movq	(%rbx), %rcx
vmovsd	(%rcx), %xmm1           # xmm1 = mem[0],zero

movq	(%r15), %rcx

vdivsd	(%rcx), %xmm1, %xmm1

vmulsd	%xmm1, %xmm1, %xmm1

vsubsd	%xmm1, %xmm0, %xmm1
vmovsd	%xmm1, 8(%rsp)

vmulsd	%xmm1, %xmm1, %xmm1
vmulsd	%xmm1, %xmm1, %xmm1

movq	(%r12), %rcx
vmovsd	%xmm1, (%rcx,%rax,8)

cmpq	%rax, %r13

jne	L320
jmp	L632

L379:
cmpq	\$1, %rax

jne	L553
xorl	%eax, %eax
movabsq	\$140486950383608, %rcx  # imm = 0x7FC5AAC5EFF8
vmovsd	(%rcx), %xmm0           # xmm0 = mem[0],zero
nopw	%cs:(%rax,%rax)

L416:
movq	(%rbx), %rcx
vmovsd	(%rcx,%rax,8), %xmm1    # xmm1 = mem[0],zero

movq	(%r15), %rcx

vdivsd	(%rcx), %xmm1, %xmm1

vsubsd	%xmm1, %xmm0, %xmm1
vmovsd	%xmm1, 8(%rsp)

vmulsd	%xmm1, %xmm1, %xmm1
vmulsd	%xmm1, %xmm1, %xmm1

movq	(%r12), %rcx
vmovsd	%xmm1, (%rcx,%rax,8)

cmpq	%rax, %r13

jne	L416
jmp	L632
L476:
xorl	%eax, %eax
movabsq	\$140486950383608, %rcx  # imm = 0x7FC5AAC5EFF8
vmovsd	(%rcx), %xmm0           # xmm0 = mem[0],zero
nopl	(%rax)

L496:
movq	(%rbx), %rcx
vmovsd	(%rcx), %xmm1           # xmm1 = mem[0],zero

movq	(%r15), %rcx
vdivsd	(%rcx,%rax,8), %xmm1, %xmm1

vmulsd	%xmm1, %xmm1, %xmm1

vsubsd	%xmm1, %xmm0, %xmm1
vmovsd	%xmm1, 8(%rsp)

vmulsd	%xmm1, %xmm1, %xmm1
vmulsd	%xmm1, %xmm1, %xmm1

movq	(%r12), %rcx
vmovsd	%xmm1, (%rcx,%rax,8)

cmpq	%rax, %r13

jne	L496
jmp	L632
L553:
xorl	%eax, %eax
movabsq	\$140486950383608, %rcx  # imm = 0x7FC5AAC5EFF8
vmovsd	(%rcx), %xmm0           # xmm0 = mem[0],zero
nopl	(%rax)

L576:
movq	(%rbx), %rcx
vmovsd	(%rcx,%rax,8), %xmm1    # xmm1 = mem[0],zero

vdivsd	(%rcx,%rax,8), %xmm1, %xmm1

vmulsd	%xmm1, %xmm1, %xmm1

vsubsd	%xmm1, %xmm0, %xmm1
vmovsd	%xmm1, 8(%rsp)

vmulsd	%xmm1, %xmm1, %xmm1
vmulsd	%xmm1, %xmm1, %xmm1

movq	(%r12), %rcx
vmovsd	%xmm1, (%rcx,%rax,8)

jne	L576

L632:
movq	24(%rsp), %rax
movq	%rax, (%rdi)
movabsq	\$140486995693576, %rax  # imm = 0x7FC5AD795008
popq	%rbx
popq	%r12
popq	%r13
popq	%r14
popq	%r15
popq	%rbp
retq

L665:
movabsq	\$jl_gc_pool_alloc, %rax
movl	\$1424, %esi             # imm = 0x590
movl	\$16, %edx
callq	*%rax
movabsq	\$jl_system_image_data, %rcx
movq	%rcx, -8(%rax)
movabsq	\$jl_system_image_data, %rcx
movq	%rcx, (%rax)
movq	%rax, 32(%rsp)

movabsq	\$jl_throw, %rcx
movq	%rax, %rdi
callq	*%rcx

L734:
movabsq	\$jl_gc_pool_alloc, %rax
movl	\$1424, %esi             # imm = 0x590
movl	\$16, %edx
callq	*%rax
movabsq	\$jl_system_image_data, %rcx
movq	%rcx, -8(%rax)
movabsq	\$jl_system_image_data, %rcx
movq	%rcx, (%rax)
movq	%rax, 32(%rsp)

movabsq	\$jl_throw, %rcx
movq	%rax, %rdi
callq	*%rcx
nopw	%cs:(%rax,%rax)
;}
``````

Postscript:
The better way is of course
`omega_5(d, dmax) = ( x= d/dmax; y = 1.0-x*x; (y*y)*(y*y));`
which saves us the extra multiply. Doesn’t matter, we appear to be constrained by the divisions, unless you are hyperthreading and someone else can use the sweet execution ports we waste.