I would rather do
abs, at least for floating point, and especially Float16. While I time
abs2 as fast as
abs, the former has more assembly instructions, so I would be suspicious of my timing. Strangely, I time the multiply as fast, while with almost two screen-fulls of assembly.
It made me think it could be exploited, see in B.
Using Float16 is as fast for me (timing with
@btime) as Float64, despite more (and all integer, not floating-point) instructions. But this, alone, would be deceiving, as you’re using a quarter of the memory bandwidth, it should be 4x faster going through memory, your real bottleneck:
julia> @code_native debuginfo=:none syntax=:intel abs(Float16(1.0)) # thanks Elrod! While the debug lines can be helpful, I had no idea you could disable them, very useful to know!
mov rax, qword ptr fs:
and di, 32767
mov qword ptr [rsp - 8], rax
mov ax, di
#Even with this giving 16 instructions including call,
#and without Float16 only 7, I time as fast:
julia> @code_native debuginfo=:none syntax=:intel Float16(2.0) > 1.0
#Longer, and additional call (but not slower...) so keep in mind:
julia> @code_native Float16(2.0) > Float16(1.0)
abs is inherently very simple for floating-point (not as much for integers, and doesn’t work for typemin), just clearing the top bit, including the storage-format Float16, that gets converted to Float64 I think when you read it (except as above), and may need to get converted back, unless you’re careful as above with having different types on each side of
>, I think you could do something clever.
It seems to me you could go through your array in chunks of 4 Float16 numbers and do 4
abs by clearing 4 bits at a time (maybe much more with vectorized instructions?), well, I’m sure of that part so far, and then eliminate 4
doSomethings at a time, on the fast path (or when one or more fails if that fast path, need to have four additional checks). This might all be too complex, and not faster since you’re limited by memory bandwidth. But maybe not, depending on the size of your array, probably only helping for small arrays, fitting in cache.
Even if not, Float16 might help.
With vectorizing, it’s important to know how many floating-point (or integer) units you have, i.e. how many instructions you can run at a time. I’m just not sure you have you have, e.g. for multiply (depends on the CPU), and it’s probably more for floats than int. But for (float) abs, I’m not sure you can have as many in-flight, as multiplies. When you can do abs with integer instructions, I’m pretty confident you could do as many, then you just need to make sure you’re not limited by converting back to floats, by not doing that.
What I found in a comment here:
Integer multiply is at least 3c latency on all recent x86 CPUs (and higher on some older CPUs). On many CPUs it’s fully pipelined, so throughput is 1 per clock, but you can only achieve that if you have three independent multiplies in flight. (FP multiply on Haswell is 5c latency, 0.5c throughput, so you need 10 in flight to saturate throughput).