What is faster, abs(x) or x^2?

I need to compare many values to one. What would be faster:

Option A:

for i = 1 : N 
    if abs(arr[i]) > someValue
        doSomething(i);
    end
end

Option B:

someValueSquared = someValue^2;

for i = 1 : N 
    if arr[i] ^ 2 > someValueSquared
        doSomething(i);
    end
end

For me it’s an inner loop over the same big array and comparison is made WAY more often than doSomething(i). Which one should I choose?

The operations are too fast to directly benchmark them with BenchmarkTools.

Hi, @azadoroz, welcome to this forum!

Actually, with a enough size it could be measured:

using BenchmarkTools

function main() # It is good to measure performance in functions
    M = rand(5_000_000); # Random vector of 5M values
    result = similar(M); # Reserve memory for the output
    @btime map!(abs, result, M); # Time for abs 
    @btime map!(x->x^2, result, M); # Time for ^2
    return # It is like return nothing, to avoid return a vector
end
main()

Actually, the times are very similar, so it does not matter. Anyway, each one of then do a completely different operation. If you have considering it for measuring errors or distances, it is better to use specific packages (but it could be great to learn).

4 Likes

If you are working with complex numbers then it’s better to use abs2

6 Likes

Hi, @dmolina, thank you!

You are right, it is all about measuring distances, but it’s not about calculating them (I mean, there is pretty much nothing new can be done here, squares are squares), the part I was concerned about is comparing.

Still, it seems weird to me. Square means calling function that calculates power, x*x would mean multiplication - OK, Julia knows that my power is integer so these two could be similar. But abs, that should check a certain bit and flip it if true (or just set it to false if this is faster), it should be a completely different story. I did run similar benchmark, but I thought that something else is time-consuming, like memory usage, and that leads to such similar benchmarks.

function calculateAbs(arr) 
    c=zeros(length(arr)); 
    for i = 1 : length(arr)
        c[i]=abs(arr[i]);
    end; 
    return c;
end

function calculateSquare(arr) 
    c=zeros(length(arr)); 
    for i = 1 : length(arr)
        c[i]=arr[i]^2;
    end; 
    return c;
end

abs_res = @benchmark calculateAbs(a)
mean(abs_res.times)
5.469088572e8

sqr_res = @benchmark calculateSquare(a)
mean(sqr_res.times)
5.373044501e8

Less than 2% difference… It’s just suspicious.

Hi, @Skoffer, no, these are just real numbers, but thank you, I haven’t seen abs2 before. There are complex numbers in the project, so probably I’ll use it.

abs(x) and x*x actually both compile down to a simple assembly instruction and both are implemented in hardware, so both of them shouldn’t take longer than a single clock cycle. There might have to be an additional mov instruction for multiplication, since you have to fill two registers instead of one, but level 1 cache is so much faster than your main system memory and modern CISC architectures are really efficient handling that sort memory access, so I wouldn’t expect this to really make a difference. Generally, for simple \mathcal O(n) loops like this, you are mostly limited by memory access, so the potential for optimization is pretty limited, if all your data is stored in a contingent array.

8 Likes

@simeonschaub, oh, right. I forgot that square is implemented into processor. Thanks, now it all makes sence. One clock for each, 64-bit architecture and 64-bit Floats, nothing to be different.

Oh, I missed to explain, that integer powers are actually special cased in the Julia parser and powers of 0, 1, 2 and 3 are just implemented as multiplication here and don’t fall back to the generic power by squaring algorithm, which would be slower for these small powers. So x^2 in Julia is actually equivalent to x*x.

7 Likes

You are measuring mostly memory access and loop overhead with these benchmarks. Here is an example, which “proves” that abs and ^2 are free:

a = rand(1000)

function fs(a)
  r = 0;
  for x in a
    r += x^2
  end
  return r
end

function fa(a)
  r = 0;
  for x in a
    r += abs(x)
  end
  return r
end

function fn(a)
  r = 0;
  for x in a
    r += x
  end
  return r
end
julia> @benchmark fs($a)
  minimum time:     1.091 μs (0.00% GC)

julia> @benchmark fa($a)
  minimum time:     1.075 μs (0.00% GC)

julia> @benchmark fn($a)
  minimum time:     1.079 μs (0.00% GC)

These three functions produce virtually identical timings on my PC. When you add a bit more work inside the loop, you will start seeing that ^2 is more expensive than abs:

function fs(a)
  r = 0;
  for x in a
    r += ((x^2-r)^2-x)^2
  end
  return r
end

function fa(a)
  r = 0;
  for x in a
    r += abs(abs(abs(x)-r)-x)
  end
  return r
end
julia> @benchmark fs($a)
  minimum time:     5.404 μs (0.00% GC)

julia> @benchmark fa($a)
  minimum time:     4.062 μs (0.00% GC)

On Intel 9th Gen. CPUs, FABS instruction (floating point absolute value) is a bit less computationally expensive that FMUL (floating point multiply). The difference heavily depends on context, but it shows in benchmarks above.

13 Likes

I’m not sure that’s a fair benchmark, since fs returns Inf, whereas fa returns 0.

My comment assumes a relatively recent x86 CPU with AVX2.

Note this isn’t actually needed:

; julia> @code_native debuginfo=:none syntax=:intel abs2(2.3)
        .text
        vmulsd  xmm0, xmm0, xmm0
        ret
        nop     word ptr cs:[rax + rax]
        nop

It multiplies the contents of xmm0 (second argument) with xmm0 (third argument), storing the result in xmm0 (the first argument). That is, instructions can reuse the same register as often as they want.

but level 1 cache is so much faster than your main system memory and modern CISC architectures are really efficient handling that sort memory access

However, sometimes copies are in fact needed. E.g., some instructions don’t give a choice and have to overwrite one of the arguments they use. In that case, L1 cache is still unnecessary, because vmovapd can move the contents of one register into another register directly (making a copy), without touching memory.

To reinforce what @pauljurczak said, here is a large table with assembly instructions for different x86 CPUs. It includes latency and (reciprocal) throughput.
You would need to consider both of these, as well as whether your application can take advantage of SIMD.

If you check the @code_native of most CPUs, you’ll see that they actually use vandps rather than fabs for floating point abs, even in the scalar case.
Similarly, they’ll use vmulsd for the scalar case (and vmulpd for SIMD).
The multiplication instructions typically have a latency of 4 clock cycles, and a reciprocal throughput of 0.5(to 1). This means it takes 4 clock cycles after an instruction starts to complete, but that if you’re doing a lot of them, it on average completes one every half-a cycle (meaning 2 are completed/cycle).

vandps on the other hand has a latency of only 1 clock cycle, and a reciprocal throughput of 0.33. That means every clock cycle 3 of them finish, and it only takes a single cycle after starting one before you can use that result as the argument to another function.

EDIT:
But to dpsanders’ point, benchmarking is tricky.

7 Likes

I was afraid, you would stumble onto this thread and destroy my half-witted believes of how a processor works :grin:. Thanks for the very enlightening reply, as always, an awesome answer!

1 Like

That is a great answer. On my machine, there appears to be a 9% penalty for x^2 compared to abs(x), both multi-threaded and not.

It could be. I just wrote an expression complex enough to prevent the compiler from eliding it. Sometimes, it can be surprisingly clever about analyzing expressions.

Here is a version where both functions return zero, but compiler is not clever enough to elide it:

a = rand(1000)

function fs(a)
  r = 0;
  for x in a
    r += x^2 - (x-r)^2 
  end
  return r
end

function fa(a)
  r = 0;
  for x in a
    r += abs(x) - abs(x-r)
  end
  return r
end

The timing is:

julia> @benchmark fs($a)
  minimum time:     4.408 μs (0.00% GC)

julia> @benchmark fa($a)
  minimum time:     3.584 μs (0.00% GC)
2 Likes

I would rather do abs, at least for floating point, and especially Float16. While I time abs2 as fast as abs, the former has more assembly instructions, so I would be suspicious of my timing. Strangely, I time the multiply as fast, while with almost two screen-fulls of assembly.

It made me think it could be exploited, see in B.

A.
Using Float16 is as fast for me (timing with @btime) as Float64, despite more (and all integer, not floating-point) instructions. But this, alone, would be deceiving, as you’re using a quarter of the memory bandwidth, it should be 4x faster going through memory, your real bottleneck:

julia> @code_native debuginfo=:none syntax=:intel abs(Float16(1.0))  # thanks Elrod! While the debug lines can be helpful, I had no idea you could disable them, very useful to know!
	.text
	mov	rax, qword ptr fs:[0]
	and	di, 32767
	mov	qword ptr [rsp - 8], rax
	mov	ax, di
	ret

#Even with this giving 16 instructions including call,
#and without Float16 only 7, I time as fast:

julia> @code_native debuginfo=:none syntax=:intel Float16(2.0) > 1.0

#Longer, and additional call (but not slower...) so keep in mind:

julia> @code_native Float16(2.0) > Float16(1.0)

julia> abs(typemin(Int8))
-128

B.
Since abs is inherently very simple for floating-point (not as much for integers, and doesn’t work for typemin), just clearing the top bit, including the storage-format Float16, that gets converted to Float64 I think when you read it (except as above), and may need to get converted back, unless you’re careful as above with having different types on each side of >, I think you could do something clever.

It seems to me you could go through your array in chunks of 4 Float16 numbers and do 4 abs by clearing 4 bits at a time (maybe much more with vectorized instructions?), well, I’m sure of that part so far, and then eliminate 4 doSomethings at a time, on the fast path (or when one or more fails if that fast path, need to have four additional checks). This might all be too complex, and not faster since you’re limited by memory bandwidth. But maybe not, depending on the size of your array, probably only helping for small arrays, fitting in cache.

Even if not, Float16 might help.

With vectorizing, it’s important to know how many floating-point (or integer) units you have, i.e. how many instructions you can run at a time. I’m just not sure you have you have, e.g. for multiply (depends on the CPU), and it’s probably more for floats than int. But for (float) abs, I’m not sure you can have as many in-flight, as multiplies. When you can do abs with integer instructions, I’m pretty confident you could do as many, then you just need to make sure you’re not limited by converting back to floats, by not doing that.

What I found in a comment here:

Integer multiply is at least 3c latency on all recent x86 CPUs (and higher on some older CPUs). On many CPUs it’s fully pipelined, so throughput is 1 per clock, but you can only achieve that if you have three independent multiplies in flight. (FP multiply on Haswell is 5c latency, 0.5c throughput, so you need 10 in flight to saturate throughput).

1 Like

Did anyone look at what Julia actually generates? I would expect floating point abs to generate a MOV and an AND. I presume floating point square (or multiply) would be slower but I am certainly no expert.

Here

julia> @code_native (x -> x^2)(1.0)
	.text
; ┌ @ REPL[3]:1 within `#9'
; │┌ @ intfuncs.jl:261 within `literal_pow'
; ││┌ @ REPL[3]:1 within `*'
	vmulsd	%xmm0, %xmm0, %xmm0
; │└└
	retq
	nopw	%cs:(%rax,%rax)
	nop
; └

julia> @code_native (x -> abs(x))(1.0)
	.text
; ┌ @ REPL[4]:1 within `#11'
	movabsq	$140061467028288, %rax  # imm = 0x7F6299FD2B40
; │┌ @ float.jl:528 within `abs'
	vandps	(%rax), %xmm0, %xmm0
; │└
	retq
	nop
; └

2 Likes