The performance of saturating operations or adding intrinsics

kimikage · October 18, 2020, 6:07am

I (we) will support saturating arithmetic in FixedPointNumbers.jl as an experimental feature until the API design is mature. (If you’re interested in the background and API design, you can also see:

)

Since the addition and subtraction of fixed-point numbers is essentially the “same” as the addition and subtraction of integers, we will discuss just integer arithmetic below.

Saturating addition and subtraction can be implemented simply as follows:

using Base.Checked

function saturating_add1(x::T, y::T) where {T <: Integer}
	r, f = add_with_overflow(x, y)
	f ? (y < zero(y) ? typemin(x) : typemax(x)) : r
end

function saturating_sub1(x::T, y::T) where {T <: Integer}
	r, f = sub_with_overflow(x, y)
	f ? (y < zero(y) ? typemax(x) : typemin(x)) : r
end

julia> saturating_add1(typemax(UInt8), oneunit(UInt8))
0xff

julia> saturating_add1(typemin(Int8), -oneunit(Int8))
-128

julia> saturating_sub1(typemin(Int8), oneunit(Int8))
-128

julia> saturating_sub1(zero(Int8), typemin(Int8))
127

However they are incredibly slow.

julia> versioninfo()
Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

julia> xu8 = rand(UInt8, 1000, 1000); yu8 = rand(UInt8, 1000, 1000);

julia> xi8 = rand(Int8, 1000, 1000); yi8 = rand(Int8, 1000, 1000);

julia> @btime $xu8 .+ $yu8;
  85.300 μs (2 allocations: 976.70 KiB)

julia> @btime saturating_add1.($xu8, $yu8);
  434.400 μs (2 allocations: 976.70 KiB)

julia> @btime saturating_add1.($xi8, $yi8);
  679.600 μs (2 allocations: 976.70 KiB)

julia> @btime $xi8 .- $yi8;
  85.600 μs (2 allocations: 976.70 KiB)

julia> @btime saturating_sub1.($xu8, $yu8);
  423.900 μs (2 allocations: 976.70 KiB)

julia> @btime saturating_sub1.($xi8, $yi8);
  680.200 μs (2 allocations: 976.70 KiB)

For that reason, the implementation in FixedPointNumbers.jl has been changed as follows.
(cf. Add checked, wrapping and saturating arithmetic for add/sub/neg by kimikage · Pull Request #190 · JuliaMath/FixedPointNumbers.jl · GitHub)

function saturating_add2(x::T, y::T) where {T <: Unsigned}
	x + min(~x, y)
end

function saturating_add2(x::T, y::T) where {T <: Signed}
	x + ifelse(x < zero(x), max(y, typemin(x) - x), min(y, typemax(x) - x))
end

function saturating_sub2(x::T, y::T) where {T <: Unsigned}
	x - min(x, y)
end

function saturating_sub2(x::T, y::T) where {T <: Signed}
	x - ifelse(x < zero(x), min(y, x - typemin(x)), max(y, x - typemax(x)))
end

However, the codes for Signed are still slow because it does not use the hardware saturating instructions.

julia> @btime saturating_add2.($xu8, $yu8);
  95.500 μs (2 allocations: 976.70 KiB)

julia> @btime saturating_add2.($xi8, $yi8);
  185.001 μs (2 allocations: 976.70 KiB)

julia> @btime saturating_sub2.($xu8, $yu8);
  109.000 μs (2 allocations: 976.70 KiB)

julia> @btime saturating_sub2.($xi8, $yi8);
  154.700 μs (2 allocations: 976.70 KiB)

Could we improve this without using low-level features like direct access to LLVM?

I think it might be worth supporting LLVM’s saturation arithmetic intrinsics in Julia v1.6.

BTW, what has happened on nightly over the past month?

julia> versioninfo()
Julia Version 1.6.0-DEV.1274
Commit 444aa87348 (2020-10-17 22:11 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.0 (ORCJIT, skylake)

julia> @btime $xu8 .+ $yu8;
  795.400 μs (2 allocations: 976.70 KiB)

kristoffer.carlsson · October 18, 2020, 11:45am

Would probably be good with an issue about that.

kimikage · October 18, 2020, 2:36pm

I filed an issue. It’s probably related to LLVM11, but I don’t know the specific cause of the problem.

github.com/JuliaLang/julia

Performance regression in broadcasting with CartesianIndices on v1.6.0-DEV

opened 01:54PM - 18 Oct 20 UTC

closed 06:46AM - 21 Jan 21 UTC

kimikage

performance regression

Although I haven't identified the cause, I've noticed ~10x slowdown in simple br…oadcasting operations on nightly. ```julia julia> versioninfo() # just the last version I had cached, not bisected Julia Version 1.6.0-DEV.1117 Commit 36effbe10a (2020-10-02 17:38 UTC) Platform Info: OS: Windows (x86_64-w64-mingw32) CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-10.0.1 (ORCJIT, skylake) julia> xu8 = rand(UInt8, 1000, 1000); yu8 = rand(UInt8, 1000, 1000); julia> @btime $xu8 .+ $yu8; 85.700 μs (2 allocations: 976.70 KiB) ``` ```julia julia> versioninfo() Julia Version 1.6.0-DEV.1274 Commit 444aa87348 (2020-10-17 22:11 UTC) Platform Info: OS: Windows (x86_64-w64-mingw32) CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-11.0.0 (ORCJIT, skylake) julia> @btime $xu8 .+ $yu8; 795.400 μs (2 allocations: 976.70 KiB) julia> xu8a = rand(UInt8, 1000 * 10, 1000); yu8a = rand(UInt8, 1000 * 10, 1000); julia> @btime $xu8a .+ $yu8a; # roughly proportional to the number of elements 11.190 ms (2 allocations: 9.54 MiB) julia> xu8b = rand(UInt8, 1000 ÷ 10, 1000); yu8b = rand(UInt8, 1000 ÷ 10, 1000); julia> @btime $xu8b .+ $yu8b; 86.100 μs (2 allocations: 97.77 KiB) ``` <details> <summary> `@code_native` result for "1-D" arrays (It's somewhat misleading. See comments below.)</summary> ``` julia> @code_native broadcast(+, UInt8[], UInt8[]) # (**Edit:** FOR 1-D ARRAYS) at least it's SIMD vectorized. ...snip... ; ││││┌ @ simdloop.jl:77 within `macro expansion' @ broadcast.jl:932 ; │││││┌ @ broadcast.jl:575 within `getindex' ; ││││││┌ @ broadcast.jl:620 within `_broadcast_getindex' ; │││││││┌ @ broadcast.jl:644 within `_getindex' @ broadcast.jl:645 ; ││││││││┌ @ broadcast.jl:614 within `_broadcast_getindex' ; │││││││││┌ @ array.jl:809 within `getindex' L1216: vmovdqu (%rcx,%rbx), %ymm0 vmovdqu 32(%rcx,%rbx), %ymm1 vmovdqu 64(%rcx,%rbx), %ymm2 vmovdqu 96(%rcx,%rbx), %ymm3 ; ││││││└└└└ ; ││││││┌ @ broadcast.jl:621 within `_broadcast_getindex' ; │││││││┌ @ broadcast.jl:648 within `_broadcast_getindex_evalf' ; ││││││││┌ @ int.jl:87 within `+' vpaddb (%r9,%rbx), %ymm0, %ymm0 vpaddb 32(%r9,%rbx), %ymm1, %ymm1 vpaddb 64(%r9,%rbx), %ymm2, %ymm2 vpaddb 96(%r9,%rbx), %ymm3, %ymm3 ; │││││└└└└ ; │││││┌ @ array.jl:847 within `setindex!' vmovdqu %ymm0, (%rdx,%rbx) vmovdqu %ymm1, 32(%rdx,%rbx) vmovdqu %ymm2, 64(%rdx,%rbx) vmovdqu %ymm3, 96(%rdx,%rbx) ; ││││└└ ; ││││┌ @ simdloop.jl:78 within `macro expansion' ; │││││┌ @ int.jl:87 within `+' subq $-128, %rbx cmpq %rbx, %rdi jne L1216 ; ││││└└ ...snip... ``` </details> ~The most noticeable difference is the LLVM version (i.e. 10 vs 11), but I have no evidence that the LLVM 11 is the cause at the moment.~

Of course, slowing down on nightly is a separate issue from this main topic.

JeffreySarnoff · October 18, 2020, 5:21pm

If providing support for LLVM’s saturation arithmetic intrinsics helps saturating integer arithmetic’s performance, I see no reason not to add that support. Additionally, saturating arithmetic is useful with machine learning (be sure to cover the smaller integer types) .

simeonschaub · October 18, 2020, 11:15pm

Isn’t that already possible using llvmcall?

Elrod · October 18, 2020, 11:33pm

Yes. This requires VectorizationBase’s master branch:

julia> using VectorizationBase, BenchmarkTools

julia> VectorizationBase.saturated_add(0x03, 0x08)
0x0b

julia> VectorizationBase.saturated_add(0x0f, 0xfc)
0xff

julia> @benchmark VectorizationBase.saturated_add($(Ref(0x0f))[], $(Ref(0xfc))[])
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.102 ns (0.00% GC)
  median time:      1.106 ns (0.00% GC)
  mean time:        1.112 ns (0.00% GC)
  maximum time:     10.967 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

kimikage · October 19, 2020, 1:25am

Yes, we can use it via llvmcall, as @Elrod gave a practical example.

However, the reason for opening this topic at this time is that Julia v1.6 is the next LTS candidate and its feature-freeze is coming up. That’s the basis for my selfish regulation:

JeffreySarnoff · October 19, 2020, 2:04am

LLVM offers Saturated Arithmetic Intrinsics for [un]signed (+, -, <<) and does not support saturated *. I noticed there are LLVM Intrinsics for Fixed Point Arithmetic supporting [un]signed (*, /).

Is there a way to use that to obtain Saturated * ?
Is there a way to use the Saturated Intrinsics to obtain Fixed Point +, - ?

kimikage · October 19, 2020, 2:18am

Very nice!

I consider LoopVectorization.jl to be a “semi-standard” library, so I don’t see any problem with that feature being added to VectorizationBase.jl. Saturating arithmetic, however, is a concept orthogonal to vectorization.

I have (or I had) plans to add saturating_* to the CheckedArithmeticCore until Core, Base or stdlib supports saturating arithmetic (and for it to be available in past versions once it is supported).
cf. Add `saturating_*` and `wrapping_*` API definitons to `CheckedArithmeticCore` · Issue #9 · JuliaMath/CheckedArithmetic.jl · GitHub

This is an aside, but FixedPointNumbers#master implements the saturating_* functions instead of the saturated_*. The names are taken from Rust. Between checked_* and checking_*, I think the former is obviously more appropriate. (Rust uses the overflowing_* for checked arithmetic, and Rust’s experimental API supports *_with_overflow, though.) However I’m not sure which is better, saturating_* or saturated_*. (I prefer saturating_* in terms of not being able to determine whether it will or will not be saturated until the evaluation.)

I don’t think the name itself is very important (as we can use aliases ), but I think we need to be cautious about the name collision because saturation arithmetic is a very “intrinsic” concept.

kimikage · October 19, 2020, 2:42am

This isn’t a direct answer to your questions, but it is a comment from a practical perspective.

First, the SIMD instruction set of x86 CPUs (and probably many ARM CPUs as well) does not support (so-called, or simple) saturating multiplication. Unless the bit width is wide, it’s fast to speculative multiplication (widemul) and then clamp the result.

Secondly, as I wrote in the OP, the addition and subtraction of fixed point numbers (with the same scaling) is identical to the addition and subtraction of integers. Fixed-point types are supported on (many) ARM CPUs, but are not natively supported on x86 CPUs.

JeffreySarnoff · October 19, 2020, 4:37am

thank you, that information is helpful.

Topic		Replies	Views
Revisiting saturating intrinsics Internals & Design llvm , arithmetic	9	462	April 25, 2024
[RFC] What should the arithmetic within the FixedPointNumbers be Visualization package , colors , numerics	65	3681	May 11, 2024
Potential solution to the overflow problem; 64-bit considered harmful, 32- or 21-bit better General Usage integer-overflow	6	3640	October 18, 2021
@fastmath macro accuracy General Usage numerics , fast-math	26	11056	May 12, 2020
Drop of performances with Julia 1.6.0 for InterpolationKernels Performance	40	2485	April 1, 2021

The performance of saturating operations or adding intrinsics

Related topics