Fast division by 2

I noticed that Julia seems not to optimize the division of floating points and integer by two. I know that in C++ there are a couple of fast function that are used in these specific cases

const SHIFT16=UInt16(1)<<EXP16
const SHIFT32=UInt32(1)<<EXP32
const SHIFT64=UInt64(1)<<EXP64

# Division & Floor if the number is odd
function fast_division_by_2(x::Integer)::Integer
   x >> ONE;
end

# Division
function fast_division_by_2(x::Float16)::Float16
    return (x!=0)*reinterpret(Float16, reinterpret(Int16, x)- SHIFT16)
end
function fast_division_by_2(x::Float32)::Float32
    return (x!=0)*reinterpret(Float32, reinterpret(Int32, x)- SHIFT32)
end
function fast_division_by_2(x::Float64)::Float64
    return (x!=0)*reinterpret(Float64, reinterpret(Int64, x)- SHIFT64)
end

The difference is significant when used with @fastmath. I compared the results using 1000 evaluations over a sample size of 10^5 elements

2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "normal" => Trial(132.278 μs)
  "fast" => Trial(10.285 μs)

julia> results["divi16"]
2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "normal" => Trial(313.996 μs)
  "fast" => Trial(31.273 μs)

julia> results["divi32"]
2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "normal" => Trial(128.252 μs)
  "fast" => Trial(43.865 μs)

julia> results["divi64"]
2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "normal" => Trial(210.600 μs)
  "fast" => Trial(165.764 μs)

julia> results["divf16"]

2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "normal" => Trial(21.816 μs)
  "fast" => Trial(19.050 μs)

julia> results["divf32"]
2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "normal" => Trial(50.990 μs)
  "fast" => Trial(41.615 μs)

julia> results["divf64"]
2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "normal" => Trial(104.193 μs)
  "fast" => Trial(88.767 μs)

There is some performance to gain using

xi=reinterpret(Float,Int)
~iszero(xi)*(xi-SHIFT)

However, since 0.0 and -0,0 are different numbers, the shift would set -0 to Inf.
Forcing to check for that elsewhere

1 Like

What led you to believe that this isn’t being optimized? I highly suspect you’re not benchmarking what you think you’re benchmarking.

2 Likes

For the integer I checked the native code:
without fastmath it always transforms the integer into float and apply the division operation. With fastmath… it moves the registry back and forth and I got lost in the operation.

For float without fastmath, it does the operation as expected, with fastmath it makes the code much harder to read but it does not operate on the exponent

I tried also to test both configurations, with and without fastmath and the results seems consistent:

julia> results["divf64"]
4-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "normal" => Trial(154.385 μs) #slower
  "normal_fastmath" => Trial(153.451 μs)
  "fast" => Trial(140.025 μs) #fastest
  "fast_fastmath" => Trial(140.205 μs)

julia> results["divf32"]
4-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "normal" => Trial(86.868 μs) #slower
  "normal_fastmath" => Trial(65.621 μs)
  "fast" => Trial(62.029 μs) #fastest
  "fast_fastmath" => Trial(63.097 μs)

julia> results["divf16"]
4-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "normal" => Trial(30.033 μs) 
  "normal_fastmath" => Trial(29.204 μs)
  "fast" => Trial(30.256 μs) #slower
  "fast_fastmath" => Trial(28.611 μs) #fastest

The big improvement is obviously on the Integer, with the catch that if the number is odd, it’s floored that in some case can be useful (transforming to float and perform a division is obviously more expensive that shifting the bit)


julia> results["divi8"]
4-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "normal" => Trial(148.345 μs)
  "normal_fastmath" => Trial(141.941 μs)
  "fast" => Trial(33.139 μs)
  "fast_fastmath" => Trial(14.183 μs)

julia> results["divi16"]
4-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "normal" => Trial(110.312 μs)
  "normal_fastmath" => Trial(129.946 μs)
  "fast" => Trial(27.827 μs)
  "fast_fastmath" => Trial(27.497 μs)

julia> results["divi32"]
4-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "normal" => Trial(151.752 μs)
  "normal_fastmath" => Trial(132.851 μs)
  "fast" => Trial(37.417 μs)
  "fast_fastmath" => Trial(67.089 μs)

julia> results["divi64"]
4-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "normal" => Trial(224.674 μs)
  "normal_fastmath" => Trial(220.752 μs)
  "fast" => Trial(190.965 μs)
  "fast_fastmath" => Trial(158.879 μs)

I do not understand why the performance on Int64 are so bad for this implementation compared to Int8,Int16 and Int32

1 Like

What is the actual code you’re running? What architecture are you on? What Julia version are you using?

It sounds like your integer division isn’t really doing / division, which, yeah.

2 Likes

n / 2 is defined as a floating-point division in Julia. If you want truncating integer division, use div(n, 2) or equivalently n ÷ 2. However, this is not equivalent to >> 1 either, because they behave differently for signed inputs:

julia> -15 / 2
-7.5

julia> -15 ÷ 2
-7

julia> -15 >> 1
-8

On the other hand, for unsigned inputs, UInt(15) ÷ 2 gets optimized to a shift.

10 Likes

For the integer I checked the native code:
without fastmath it always transforms the integer into float and apply the division operation. With fastmath… it moves the registry back and forth and I got lost in the operation.

For float without fastmath, it does the operation as expected, with fastmath it makes the code much harder to read but it does not operate on the exponent

I tried also to test both configurations, with and without fastmath and the results seems consistent:

julia> results["divf64"]
4-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "normal" => Trial(154.385 μs) #slower
  "normal_fastmath" => Trial(153.451 μs)
  "fast" => Trial(140.025 μs) #fastest
  "fast_fastmath" => Trial(140.205 μs)

julia> results["divf32"]
4-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "normal" => Trial(86.868 μs) #slower
  "normal_fastmath" => Trial(65.621 μs)
  "fast" => Trial(62.029 μs) #fastest
  "fast_fastmath" => Trial(63.097 μs)

julia> results["divf16"]
4-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "normal" => Trial(30.033 μs) 
  "normal_fastmath" => Trial(29.204 μs)
  "fast" => Trial(30.256 μs) #slower
  "fast_fastmath" => Trial(28.611 μs) #fastest

The big improvement is obviously on the Integer, with the catch that if the number is odd, it’s floored that in some case can be useful


julia> results["divi8"]
4-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "normal" => Trial(148.345 μs)
  "normal_fastmath" => Trial(141.941 μs)
  "fast" => Trial(33.139 μs)
  "fast_fastmath" => Trial(14.183 μs)

julia> results["divi16"]
4-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "normal" => Trial(110.312 μs)
  "normal_fastmath" => Trial(129.946 μs)
  "fast" => Trial(27.827 μs)
  "fast_fastmath" => Trial(27.497 μs)

julia> results["divi32"]
4-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "normal" => Trial(151.752 μs)
  "normal_fastmath" => Trial(132.851 μs)
  "fast" => Trial(37.417 μs)
  "fast_fastmath" => Trial(67.089 μs)

julia> results["divi64"]
4-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "normal" => Trial(224.674 μs)
  "normal_fastmath" => Trial(220.752 μs)
  "fast" => Trial(190.965 μs)
  "fast_fastmath" => Trial(158.879 μs)

I do not understand why the performance on Int64 are so bad for this implementation compared to Int8,Int16 and Int32

oooh cool I didn’t check the Unsigned one. That is good, It’s much more useful

Julia 1.11 on x64

I did it mostly for the floating point division cause it was the operation I need.

But as @stevengj pointed out, the div is the one using shifting for integer :slight_smile:

I’d still be quite surprised if you can do meaningfully better than x/2 without completely wrecking the correctness of the operation. Julia compiles that down to a multiplication by 0.5 for floating point numbers — something your processor is quite good at doing.

But I still don’t know what you’re doing, so I can only guess. It’s not hard to minimize this into a reproducible REPL snippet:

julia> using Base: IEEEFloat, inttype, significand_bits

julia> div2(x) = x/2
div2 (generic function with 1 method)

julia> fastdiv2(x::T) where {T<:IEEEFloat} = (x!=0)*reinterpret(T, reinterpret(inttype(T), x) - (one(inttype(T))<<significand_bits(T)))
fastdiv2 (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime map!(div2, x, x) setup=x=rand(10^5);
  14.042 μs (0 allocations: 0 bytes)

julia> @btime map!(fastdiv2, x, x) setup=x=rand(10^5);
  16.041 μs (0 allocations: 0 bytes)
6 Likes

Could you please tell us what you have actually done. I get multiplication, not division.

f(x) = x / 2
julia> @code_native f(3)
        push    rbp
        movabs  rax, offset .LCPI0_0
        vcvtsi2sd       xmm0, xmm0, rdi
        mov     rbp, rsp
        vmulsd  xmm0, xmm0, qword ptr [rax]
        pop     rbp
        ret
julia> @code_native f(3.2)
        push    rbp
        movabs  rax, offset .LCPI0_0
        mov     rbp, rsp
        vmulsd  xmm0, xmm0, qword ptr [rax]
        pop     rbp
        ret
julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 2700X Eight-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver1)
Threads: 16 default, 0 interactive, 8 GC (on 16 virtual cores)
2 Likes

My function simply subtracts the first bit of the mantissa of an IEEE number to divide by 2. If that was all, it would be faster that the division, however I need to to check for Inf and NaN.

   %1 = bitcast double %0 to i64                          #transform to the uinttype(T)
   %2 = add i64 %1, -4503599627370496           #subtract the first bit of the mantissa
   %3 = and i64 %2, 9218868437227405312      #mask the mantissa
   %.not = icmp eq i64 %3, 9218868437227405312 # check if the mantissa is all ones (Inf or NaN)
   %4 = select i1 %.not, i64 0, i64 %2        # choose the 0 if it is Inf or NaN              
   %5 = bitcast i64 %4 to double                # convert back to T
   ret double %5                                         # return

When I played around with the function div2(x)=x/2, the compiler automatically makes it into for both Float32 and Float64

   %1 = fmul double %0, 5.000000e-01
   ret double %1

For Float16 however, to avoid losing precision, Julia first convert it to Float32, then it makes the division and convert it trunc it to a Float16

   %1 = fpext half %0 to float
   %2 = fmul float %1, 5.000000e-01
   %3 = fptrunc float %2 to half
   ret half %3

Due to the presence of the check, the code runs a slower than the normal division for Float64

Mine

BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):   9.433 μs … 94.567 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     14.300 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.000 μs ±  3.675 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

Julia

BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):   8.333 μs … 167.133 μs  ┊ GC (min … max): 0.00% … 0.00%   
 Time  (median):     13.233 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.871 μs ±   3.662 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%   

For Float32 (on an AMD 64bit), they are pretty close, but Julia wins
Mine

BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.400 μs … 31.137 μs  ┊ GC (min … max): 0.00% … 0.00%     
 Time  (median):     4.237 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.747 μs ±  1.240 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%     

Julia

BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.087 μs … 43.788 μs  ┊ GC (min … max): 0.00% … 0.00%     
 Time  (median):     4.638 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.955 μs ±  1.531 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%     

However, for the Float16, transforming from and back to float, makes the operation extremely slow (and probably, not as precise as the division by subtracting the mantissa bit)

Mine

BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.611 μs … 28.189 μs  ┊ GC (min … max): 0.00% … 0.00%     
 Time  (median):     2.833 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.362 μs ±  1.263 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%  

Julia

BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.117 μs …  31.983 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.433 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.576 μs ± 855.204 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%    

The main reason why I was looking into it was curiosity (when I was testing the inline function, Julia seemed to use a division by two instead of multiplying by the inverse, that, at least on older machines, it was a much slower operation).

It was fun to play around with it and playing with LLVM IR, but the more I look into it the more I see the flows. For example, using Float64 you cannot divide anything smaller that ~2e-308

and where is the source code for this?

2 Likes

Or using LLVM.jl:

julia

using LLVM.Interop
using Base: IEEEFloat, uinttype, significand_bits,inttype
using LLVM
using LLVM.API: LLVMIntPredicate
@generated function fastdiv2(x::T) where {T<:IEEEFloat}
    @dispose ctx=Context() begin
        Tshift = uinttype(T)(significand_bits(T))

        eltyp = convert(LLVMType, T)
        I= uinttype(T)
        midtyp = convert(LLVMType, I)

        Oshift = (one(I) << Tshift)
        Val = (reinterpret(I, typemax(T))>>Tshift)
        paramtyps = [eltyp]
        f, ft = create_function(eltyp, paramtyps)

        # Generate IR
        @dispose builder=IRBuilder() begin
            entry = BasicBlock(f, "entry")
            position!(builder, entry)
            # Bitcast the input from eltyp to midtyp
            a = bitcast!(builder, parameters(f)[1], midtyp)
            a = sub!(builder, a, ConstantInt(Oshift))
            b = bitcast!(builder, a, midtyp)  # mov b,a
            a = lshr!(builder, a,ConstantInt(Tshift))
            a = and!(builder, a, ConstantInt(Val))
            a = icmp!(builder, LLVM.API.LLVMIntULT, a, ConstantInt(Val))
            a = zext!(builder, a, midtyp)
            b = mul!(builder, b, a)
            b = bitcast!(builder, b, eltyp)
            ret!(builder, b)
        end
        return call_function(f, T, Tuple{T}, :x)
    end
end

It does the same thing of the Julia code (I checked the code_llvm) but I am sure is doing what I want

1 Like

It’s not for not losing precision, but because your hardware very likely literally wouldn’t be able to do those operations natively, and those conversions are necessary. If you tried to run the same functions on a CPU with native support for float16 (like avx512-fp16 or apple silicon, or arm a64fx or arm Neoverse or many other CPUs I don’t know right now) you’d see that there’s no intermediate conversions but all operations are done natively on float16 registries because they do exist.

Side note, you keep posting ignoring all the requests to share the Julia code you’re trying to run (I’m assuming that’s not the llvm one you just posted), it’s pretty hard for anyone to follow what you’re doing and give any meaningful advice.

5 Likes

oh no it’s exactly the code I was trying to run. Nothing else. I wanted to plug it in another function, but that’s the code I wrote, and I was testing, I wasn’t trying to dodge the question, I just got curious when trying to do some operations on a script the code wasn’t optimized into a multiplication.

If the question is WHY I was trying to “optimize” a division by 2, is to implement Po-Shen’s method to compute the roots of a quadratic equation

Po Shen’s Paper

There was a computerphile video that came out last year showing the implementations that requires less divisions (actually a single one) and a couple of multiplications

Is this New Quadratic Formula Actually Faster than the OG Quadratic Formula? (For a Computer)

This is the full context. I was also developing some code for a problem that eventually reduces to a quadratic equation, so perform it fast was something I was interested in.