The performance of saturating operations or adding intrinsics

I (we) will support saturating arithmetic in FixedPointNumbers.jl as an experimental feature until the API design is mature. (If you’re interested in the background and API design, you can also see:


Since the addition and subtraction of fixed-point numbers is essentially the “same” as the addition and subtraction of integers, we will discuss just integer arithmetic below.

Saturating addition and subtraction can be implemented simply as follows:

using Base.Checked

function saturating_add1(x::T, y::T) where {T <: Integer}
	r, f = add_with_overflow(x, y)
	f ? (y < zero(y) ? typemin(x) : typemax(x)) : r

function saturating_sub1(x::T, y::T) where {T <: Integer}
	r, f = sub_with_overflow(x, y)
	f ? (y < zero(y) ? typemax(x) : typemin(x)) : r
julia> saturating_add1(typemax(UInt8), oneunit(UInt8))

julia> saturating_add1(typemin(Int8), -oneunit(Int8))

julia> saturating_sub1(typemin(Int8), oneunit(Int8))

julia> saturating_sub1(zero(Int8), typemin(Int8))

However they are incredibly slow.:sob:

julia> versioninfo()
Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

julia> xu8 = rand(UInt8, 1000, 1000); yu8 = rand(UInt8, 1000, 1000);

julia> xi8 = rand(Int8, 1000, 1000); yi8 = rand(Int8, 1000, 1000);

julia> @btime $xu8 .+ $yu8;
  85.300 μs (2 allocations: 976.70 KiB)

julia> @btime saturating_add1.($xu8, $yu8);
  434.400 μs (2 allocations: 976.70 KiB)

julia> @btime saturating_add1.($xi8, $yi8);
  679.600 μs (2 allocations: 976.70 KiB)

julia> @btime $xi8 .- $yi8;
  85.600 μs (2 allocations: 976.70 KiB)

julia> @btime saturating_sub1.($xu8, $yu8);
  423.900 μs (2 allocations: 976.70 KiB)

julia> @btime saturating_sub1.($xi8, $yi8);
  680.200 μs (2 allocations: 976.70 KiB)

For that reason, the implementation in FixedPointNumbers.jl has been changed as follows.

function saturating_add2(x::T, y::T) where {T <: Unsigned}
	x + min(~x, y)

function saturating_add2(x::T, y::T) where {T <: Signed}
	x + ifelse(x < zero(x), max(y, typemin(x) - x), min(y, typemax(x) - x))

function saturating_sub2(x::T, y::T) where {T <: Unsigned}
	x - min(x, y)

function saturating_sub2(x::T, y::T) where {T <: Signed}
	x - ifelse(x < zero(x), min(y, x - typemin(x)), max(y, x - typemax(x)))

However, the codes for Signed are still slow because it does not use the hardware saturating instructions. :confused:

julia> @btime saturating_add2.($xu8, $yu8);
  95.500 μs (2 allocations: 976.70 KiB)

julia> @btime saturating_add2.($xi8, $yi8);
  185.001 μs (2 allocations: 976.70 KiB)

julia> @btime saturating_sub2.($xu8, $yu8);
  109.000 μs (2 allocations: 976.70 KiB)

julia> @btime saturating_sub2.($xi8, $yi8);
  154.700 μs (2 allocations: 976.70 KiB)

Could we improve this without using low-level features like direct access to LLVM?

I think it might be worth supporting LLVM’s saturation arithmetic intrinsics in Julia v1.6.

BTW, what has happened on nightly over the past month?

julia> versioninfo()
Julia Version 1.6.0-DEV.1274
Commit 444aa87348 (2020-10-17 22:11 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.0 (ORCJIT, skylake)

julia> @btime $xu8 .+ $yu8;
  795.400 μs (2 allocations: 976.70 KiB)

Would probably be good with an issue about that.

I filed an issue. It’s probably related to LLVM11, but I don’t know the specific cause of the problem.

Of course, slowing down on nightly is a separate issue from this main topic.

1 Like

If providing support for LLVM’s saturation arithmetic intrinsics helps saturating integer arithmetic’s performance, I see no reason not to add that support. Additionally, saturating arithmetic is useful with machine learning (be sure to cover the smaller integer types) .

1 Like

Isn’t that already possible using llvmcall?

Yes. This requires VectorizationBase’s master branch:

julia> using VectorizationBase, BenchmarkTools

julia> VectorizationBase.saturated_add(0x03, 0x08)

julia> VectorizationBase.saturated_add(0x0f, 0xfc)

julia> @benchmark VectorizationBase.saturated_add($(Ref(0x0f))[], $(Ref(0xfc))[])
  memory estimate:  0 bytes
  allocs estimate:  0
  minimum time:     1.102 ns (0.00% GC)
  median time:      1.106 ns (0.00% GC)
  mean time:        1.112 ns (0.00% GC)
  maximum time:     10.967 ns (0.00% GC)
  samples:          10000
  evals/sample:     1000

Yes, we can use it via llvmcall, as @Elrod gave a practical example.

However, the reason for opening this topic at this time is that Julia v1.6 is the next LTS candidate and its feature-freeze is coming up. That’s the basis for my selfish regulation:

LLVM offers Saturated Arithmetic Intrinsics for [un]signed (+, -, <<) and does not support saturated *. :roll_eyes: I noticed there are LLVM Intrinsics for Fixed Point Arithmetic supporting [un]signed (*, /).

  • Is there a way to use that to obtain Saturated * ?
  • Is there a way to use the Saturated Intrinsics to obtain Fixed Point +, - ?
1 Like

Very nice! :+1:

I consider LoopVectorization.jl to be a “semi-standard” library, so I don’t see any problem with that feature being added to VectorizationBase.jl. Saturating arithmetic, however, is a concept orthogonal to vectorization.

I have (or I had) plans to add saturating_* to the CheckedArithmeticCore until Core, Base or stdlib supports saturating arithmetic (and for it to be available in past versions once it is supported).

This is an aside, but FixedPointNumbers#master implements the saturating_* functions instead of the saturated_*. The names are taken from Rust. Between checked_* and checking_*, I think the former is obviously more appropriate. (Rust uses the overflowing_* for checked arithmetic, and Rust’s experimental API supports *_with_overflow, though.) However I’m not sure which is better, saturating_* or saturated_*. (I prefer saturating_* in terms of not being able to determine whether it will or will not be saturated until the evaluation.)

I don’t think the name itself is very important (as we can use aliases :smile:), but I think we need to be cautious about the name collision because saturation arithmetic is a very “intrinsic” concept.

This isn’t a direct answer to your questions, but it is a comment from a practical perspective.

First, the SIMD instruction set of x86 CPUs (and probably many ARM CPUs as well) does not support (so-called, or simple) saturating multiplication. Unless the bit width is wide, it’s fast to speculative multiplication (widemul) and then clamp the result.

Secondly, as I wrote in the OP, the addition and subtraction of fixed point numbers (with the same scaling) is identical to the addition and subtraction of integers. Fixed-point types are supported on (many) ARM CPUs, but are not natively supported on x86 CPUs.

thank you, that information is helpful.