In Binary Floating Point, you can multiply without multiplying

Palli · October 15, 2024, 11:28am

This is very intriguing (though I think the benefit of the methods doesn’t translate to 4-bit floating point, it may not matter as much, since likely useless for training neural networks; while useful for inference, though even then 2/3-bit integers taking over).

The proposed ℒ-Mul method will lead to a significantly reduced energy consumption for both model training and inference. […] multiplying two 32-bit floating point numbers (fp32) costs four times energy as adding two fp32 numbers, and 37 times higher cost than adding two 32-bit integers (int32). The rough energy costs for various operations are shown in Table 1. In PyTorch (Paszke et al., 2019), the default precision for accumulating tensor multiplication results is set to fp32. While I/O and control operations are not considered, approximating fp32 multiplications with int32 additions consumes only 1/37≈2.7% of the energy.

Actually their 4-bit mantissa is more accurate and faster than:

we find that ℒ-Mul is more accurate than fp8_e5m2 with evenly distributed operands. However, the weight distribution is often biased in pretrained LLMs. Based on the combined weight distribution of five popular LLMs, we find that ℒ-Mul can achieve higher precision beyond fp8_e4m3 with 5-bit mantissa operands in practice. We support both claims with estimated errors detailed in Appendix A.

2.3.2 Gate Complexity Estimation

[…]

In conclusion, the total amount of gate-level computation needed by fp8 Mul can be estimated as

Nfp16×≈584, Nfp8-e4m3×≈325, Nfp8-e5m2×≈296 (6)

ℒ-Mul consumes 1 XOR for sign prediction, 1 half adder, and k−2 full adders. The total gate count needed by 16-bit and 8-bit ℒ-Mul can be estimated as follows,

[…] Nfp16ℒ⁢-mul≈256,Nfp8ℒ⁢-mul≈157 (7)

ℒ-Mul with fp8_e4m3 and fp8_e5m2 operands have similar complexity since exponent offsets are typically implemented by 8-bit unsigned integer adders. As estimated, fp16 ℒ-Mul requires less gates than fp8 multiplications, and fp8 ℒ-Mul is significantly more efficient.

To summarize the error and complexity analysis, ℒ-Mul is both more efficient and more accurate than fp8 multiplication.

Note:

Related Work

[…]
Rounding and quantization. Standard neural network weights are stored as 32-bit or 16-bit FP tensors. However, the full-sized weights takes a considerable amount of GPU memory. To improve the storage efficiency, both weights storage and computation can be conducted in a lower precision, for example, using 16-bit, 8-bit, or 4-bit FP and Int (fp16, bf16 (Kalamkar et al., 2019), fp8-e4m3, fp8-e5m2 (Micikevicius et al., 2023), int8 (Dettmers et al., 2022), fp4, and int4 (Dettmers et al., 2024)) tensors to represent model weights.

They only compared to fp8, and with fp4 (or int4) being so small, wouldn’t a 32-byte lookup table be faster for any operation?

StevenSiew · October 17, 2024, 6:58am

My new Float64BitTools.jl module which uses the package

BitOperations.jl

Here is the source for “/Users/ssiew/juliascript/describebits/Float64BitTools.jl”

module Float64BitTools

# export debugging tools
export generate_correct_arr

# export module normal functions
export NewDescribeFloat64

export cardinal_bget, ordinal_bget
export cardinal_bset, ordinal_bset
export cardinal_bflip, ordinal_bflip
export GetFloat64asInt64Array
export GetFloat64asUInt64
export GetFloat64asBitString
export ConvertInt64Array2BitString
export ConvertBitString2Int64Array
export GetFloat64ExponentasInt64Array
export GetFloat64ExponentasBitString
export ConvertInt64Array2UInt64
export ConvertUInt64toInt64Array
export GetFloat64ExponentasRawExponent
export GetFloat64ExponentasNormalExponent
export SetFloat64ExponentfromRawExponent
export SetFloat64ExponentfromNormalExponent
export AdjustFloat64Exponent
export SetFloat64MantissafromMantissa
export GetFloat64MantissaasInt64Array
export GetFloat64MantissaasBitString

export Test_Float64BitTools

using BitOperations: bget as cardinal_bget,
                     bset as cardinal_bset,
                     bflip as cardinal_bflip,
                     BitOperations

###############################################################
#  Float64 Bit details in ordinal nums                        #
#                                                             #
#  signbit      ordinal pos 64                                #
#  exponent     ordinal pos 53 - 63   length = 11             #
#  mantissa     ordinal pos 1 - 52    length = 52             #
###############################################################

signbit_ordinal_pos = 64
exponent_ordinal_range = 53:63
mantissa_ordinal_range =  1:52
Float64ExponentBias = 1023

# Utility to convert integer to UInt64
ui(n::Integer) = convert(UInt64,n)
# Utility to convert integer to string of UInt64
function uis(n::Integer)
    arr = [ '0' for k = 1:16 ]
    str = reverse(string(n,base=16))
    for k = 1:length(str)
        arr[k] = str[k]
    end
    return "0x" * String(reverse(arr))
end

function generate_correct_arr(arr::Array{Int64,1})
    print("correct_x = Int64[")
    print(arr[1])
    for k in 2:length(arr) 
        print(",",arr[k])
    end
    println("]")
end

# an UIint64 Array of Binsry Bit Value
BinaryBitValueArray = UInt64[ 0x0000000000000002 ^ (k - 0x0000000000000001)  for k = 0x0000000000000001:0x0000000000000040]

function ordinal_bget(x::T, bit::Integer)::Bool where {T<:Integer}
    return cardinal_bget(x,bit - 1)
end

function ordinal_bset(x::T, bit::Integer, y::Bool)::T where {T<:Integer}
    return cardinal_bset(x,bit - 1,y)
end

function ordinal_bset(x::T, bit::Integer)::T where {T<:Integer}
    return cardinal_bset(x,bit - 1)
end

function ordinal_bset(x::T, bits::UnitRange{<:Integer}, y::Integer)::T where {T<:Integer}
    newrange = (bits.start-1):(bits.stop-1)
    return cardinal_bset(x,newrange)
end

function ordinal_bflip(x::T, bit::Integer)::T where {T<:Integer}
    newrange = (bits.start-1):(bits.stop-1)
    return cardinal_bflip(x,bit - 1)
end

function GetFloat64asInt64Array(x::Float64)
    local xu = reinterpret(UInt64, x)
    local result = zeros(Int64,64)
    for k = 1:64
        result[k] = ordinal_bget(xu,k)
    end
    return result
end

function GetFloat64asUInt64(x::Float64)
    local xu = reinterpret(UInt64, x)
    return xu
end

function GetFloat64asBitString(x::Float64)
    return String(    Char[ k == 0 ? '0' : '1' for k in reverse(GetFloat64asInt64Array(x))]    )
end

function ConvertInt64Array2BitString(arr::Array{Int64,1})
    return String(    Char[ k == 0 ? '0' : '1' for k in reverse(arr)]    )
end

function ConvertBitString2Int64Array(bs::String)
    return [ k == '0' ? 0 : 1  for k in reverse(bs) ]
end

function GetFloat64ExponentasInt64Array(x::Float64)
    local xu = reinterpret(UInt64, x)
    local result = zeros(Int64,length(exponent_ordinal_range))
    for k = exponent_ordinal_range
        result[k - (exponent_ordinal_range.start - 1)] = ordinal_bget(xu,k)
    end
    return result
end

function GetFloat64ExponentasBitString(x::Float64)
    return ConvertInt64Array2BitString(GetFloat64ExponentasInt64Array(x))
end

function ConvertInt64Array2UInt64(arr::Array{Int64,1})
    global BinaryBitValueArray
    result = 0x0000000000000000
    for k = 1:length(arr)
        result += arr[k] == 1 ? BinaryBitValueArray[k] : 0x0000000000000000
    end
    return result
end

function ConvertUInt64toInt64Array(u::UInt64)
    local num = u
    local arr = zeros(Int64,64)
    for k = 1:64
        arr[k] = convert(Int64,num % 2)
        num = div(num,2)
    end
    return arr
end

function GetFloat64ExponentasRawExponent(x::Float64)
    return convert(Int64,    ConvertInt64Array2UInt64(GetFloat64ExponentasInt64Array(x))    )
end

function GetFloat64ExponentasNormalExponent(x::Float64)
    return convert(Int64,GetFloat64ExponentasRawExponent(x)) - Float64ExponentBias
end

function SetFloat64ExponentfromRawExponent(x::Float64,u::UInt64)
    xu = reinterpret(UInt64, x)
    local arr = ConvertUInt64toInt64Array(u)
    for k = 1:length(exponent_ordinal_range)
        xu = ordinal_bset(xu,k+(exponent_ordinal_range.start - 1),arr[k]==1)
    end
    xf = reinterpret(Float64,xu)
    return xf
end

function SetFloat64ExponentfromRawExponent(x::Float64,i::Int64)
    u = convert(UInt64,i)
    return SetFloat64ExponentfromRawExponent(x,u)
end

function SetFloat64ExponentfromNormalExponent(x::Float64,i::Int64)
    xu = reinterpret(UInt64, x)
    rawexpo_value = i + Float64ExponentBias
    # Now check for out of range
    if rawexpo_value < 0 
        rawexpo_value = 0
    end
    if rawexpo_value > 2047
        rawexpo_value = 2047
    end
    u = convert(UInt64,rawexpo_value)
    local arr = ConvertUInt64toInt64Array(u)
    for k = 1:length(exponent_ordinal_range)
        xu = ordinal_bset(xu,k+(exponent_ordinal_range.start - 1),arr[k]==1)
    end
    xf = reinterpret(Float64,xu)
    return xf
end

function AdjustFloat64Exponent(x::Float64,adj::Int64)
    if isnan(x) || isinf(x)
        return x
    end
    NormalExponent = GetFloat64ExponentasNormalExponent(x)
    if NormalExponent == 1024 ||  NormalExponent == -1023
        "The number is INF or subnormal and we cannot adjust the number"
    else
        # we can adjust the number
        NormalExponent += adj 
        # Now check for subnormal 
        if NormalExponent < -1022 || NormalExponent > 1023
            NormalExponent = -1023
        end
    end
    return SetFloat64ExponentfromNormalExponent(x,NormalExponent)
end

function SetFloat64MantissafromMantissa(x::Float64,u::UInt64)
    xu = reinterpret(UInt64, x)
    local arr = ConvertUInt64toInt64Array(u)
    for k = 1:length(mantissa_ordinal_range)
        xu = ordinal_bset(xu,k+(mantissa_ordinal_range.start - 1),arr[k]==1)
    end
    xf = reinterpret(Float64,xu)
    return xf
end

function SetFloat64MantissafromMantissa(x::Float64,i::Int64)
    u = convert(UInt64,i)
    return SetFloat64MantissafromMantissa(x,u)
end

function GetFloat64MantissaasInt64Array(x::Float64)
    local xu = reinterpret(UInt64, x)
    local result = zeros(Int64,length(mantissa_ordinal_range))
    for k = mantissa_ordinal_range
        result[k - (mantissa_ordinal_range.start - 1)] = ordinal_bget(xu,k)
    end
    return result
end

function GetFloat64MantissaasBitString(x::Float64)
    return ConvertInt64Array2BitString(GetFloat64MantissaasInt64Array(x))
end

function NewDescribeFloat64(x::Float64;verbose=false)

    # Local function vprintln
    function vprintln(args...)
        if verbose == true
            println(args...)
        end
    end


    # Local variables used in this function #
    local specialflags = false # Boolean flag
    local bs = GetFloat64asBitString(x)  # Character bit string
    local signbit = bs[1] == '1' ? 1 : 0 # Sign Bit Int64
    local expoStr = String(@view bs[2:12]) # a View of bit stream relevant to exponent
    # exponent Array{Int64,1}
    local expoArray = ConvertBitString2Int64Array(expoStr)
    local expo  = GetFloat64ExponentasRawExponent(x)
    local mantissaStr = String(@view bs[13:64]) # a View of bit stream relevant to mantissa
    # mantissa Array{Int64,1}
    local mantissaArray = ConvertBitString2Int64Array(mantissaStr)
    local mysignificand = string(significand(x)) # significand Float64
    # Strip the negative sign if second character is a number digit
    if mysignificand[1] == '-' &&  mysignificand[2] == '1' || mysignificand[2] == '0'
        mysignificand = lstrip(mysignificand,'-')
    end

    # List of special Boolean flags
    local INF_flag = false # Boolean flag
    local NaN_flag = false # Boolean flag
    local subnormal_flag = false
    # End of local variables

    println("------------------------------------------------------------")
    println("Float64 value : ",x)
    println("Raw Bitstring : ",bs,"₂")    # \_2<tab> becomes ₂
    println("======================")
    println("Sign Bit : ",bs[1],"₂")
    println("Sign Value : ",signbit)

    println("======================")
    vprintln("Expo Array is ",expoArray)
    println("Exponent Bits : ",expoStr,"₂")  
    println("Exponent raw value : ",expo)
    println("Exponent normalised value : ",expo," - $(Float64ExponentBias) = ",expo - Float64ExponentBias)

    println("======================")
    vprintln("Typeof mantissa is ",typeof(mantissaArray))
    vprintln("mantissa Array is ",mantissaArray)
    println("Mantissa : I.",bs[13:64],"₂")
    println("Significand : ",mysignificand)
 
    println("======================")
    ##############################################
    # Now check ALL conditions for special flags #
    ##############################################

    # Step 1: Checking for INF
    #     If all bits in mantissa are ZERO 
    #        and all bits in exponent are ONE
    #     then INF_flag is true
    if sum(mantissaArray) == 0 && prod(expoArray) == 1
        specialflags = true
        INF_flag = true
    end

    # Step 2: Checking for NaN
    #     If some bits in mantissa are NOT ZERO 
    #        and all bits in exponent are ONE
    #     then NaN_flag is true
    if sum(mantissaArray) > 0 && prod(expoArray) == 1
        specialflags = true
        NaN_flag = true
    end

    # Step 2: Checking for subnormal
    #     If some bits in mantissa are NOT ZERO 
    #        and all bits in exponent are ZERO
    #     then subnormal_flag is true
    if sum(mantissaArray) > 0 && sum(expoArray) == 0
        specialflags = true
        subnormal_flag = true
    end

    print("Value is ")

    if specialflags == false
      println("(-1)^",signbit," * ",mysignificand," * 2^",expo - 1023)
    else # specialflags == true
        if INF_flag
            print(signbit == 1 ? "Negative " : "Positive ")
            println("INF")
        end
        if NaN_flag
            println("NaN")
        end
        if subnormal_flag
            println("subnormal")
        end
    end

    println("Value is ",x)
    println("------------------------------------------------------------")
end

function Test_Float64BitTools()
    println("Testing Float64BitTools")
    correct_a = Int64[1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,1,1,1,1,1,1,1,1,0,0]
    a = GetFloat64asInt64Array(1.2)
    correct_b = "0011111111110011001100110011001100110011001100110011001100110011"
    b = GetFloat64asBitString(1.2)
    println("a = GetFloat64asInt64Array(1.2)")
    println("b = GetFloat64asBitString(1.2)")
    println("a == correct_a : ",a == correct_a)
    println("b == correct_b : ",b == correct_b)
    bb = ConvertInt64Array2BitString(a)
    println("bb = ConvertInt64Array2BitString(a)")
    println("b == bb : ",b == bb)
    aa = ConvertBitString2Int64Array(bb)
    println("aa = ConvertBitString2Int64Array(bb)")
    println("a == aa : ",a == aa)
    c = GetFloat64ExponentasInt64Array(1.2)
    println("c = GetFloat64ExponentasInt64Array(1.2)")
    println(c)
    correct_c = Int64[1,1,1,1,1,1,1,1,1,1,0]
    println("c == correct_c : ",c == correct_c)
    d = GetFloat64ExponentasBitString(1.2)
    println("d = GetFloat64ExponentasBitString(1.2)")
    println(d)
    correct_d = "01111111111"
    println("d == correct_d : ",d == correct_d)
    e = ConvertInt64Array2UInt64(c)
    correct_e = 0x00000000000003ff
    println("e = ConvertInt64Array2UInt64(c)")
    println(e)
    println("e == correct_e : ",e == correct_e)
    ee = ConvertUInt64toInt64Array(e)
    println("ee = ConvertUInt64toInt64Array(e)")
    println("ee == c : ",ee == c)
    f = GetFloat64ExponentasRawExponent(1.2)
    println("f = GetFloat64ExponentasRawExponent(1.2)")
    println(f)
    correct_f = 0x00000000000003ff
    println("f == correct_f : ",f == correct_f)
    g = GetFloat64ExponentasNormalExponent(1.2)
    println("g = GetFloat64ExponentasNormalExponent(1.2)")
    println(g)
    correct_g = 0
    println("g == correct_g : ",g == correct_g)
    h = 1.2
    i = SetFloat64ExponentfromRawExponent(h,f+1)
    println("h=1.2")
    println("i=SetFloat64ExponentfromRawExponent(h,f+1)")
    println(i)
    j = SetFloat64ExponentfromNormalExponent(h,2)
    println("j = SetFloat64ExponentfromNormalExponent(h,2)")
    println(j)
    k = AdjustFloat64Exponent(h,-2)
    println("k = AdjustFloat64Exponent(h,-2)")
    println(k)
    l = SetFloat64MantissafromMantissa(h,0x000000000000000)
    println("l = SetFloat64MantissafromMantissa(h,0x000000000000000)")
    println(l)
    m = SetFloat64MantissafromMantissa(h,0)
    println("m = SetFloat64MantissafromMantissa(h,0)")
    println(m)
    n = GetFloat64MantissaasInt64Array(h)
    println("n = GetFloat64MantissaasInt64Array(h)")
    println(n)
    correct_n = Int64[1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0]
    println("n == correct_n : ",n == correct_n)
    o = GetFloat64MantissaasBitString(h)
    println("o = GetFloat64MantissaasBitString(h)")
    correct_o = "0011001100110011001100110011001100110011001100110011"
    println("o == correct_o : ",o == correct_o)
end

end # module Float64BitTools

Here is the new improved source code to prove that you can multiply without mutliplying, using Float64BitTools

using Pkg
Pkg.activate("/Users/ssiew/juliascript/describebits")

include("/Users/ssiew/juliascript/describebits/Float64BitTools.jl")

using .Float64BitTools

function square(x::Float64)
    # The square of any real number is positive
    # so we might as well treat the input as a positive number
    local remainder = abs(x)
    local result = 0.0
    while remainder > 0.0
        local n = GetFloat64ExponentasNormalExponent(remainder)
        two_to_the_n = SetFloat64MantissafromMantissa(remainder,0)
        two_to_the_2n = AdjustFloat64Exponent(two_to_the_n,n)
        result += two_to_the_2n
        # calc new remainder from old remainder
        remainder = remainder - two_to_the_n
        if remainder > 0.0
            # result = result + 2 * 2^n * remainder
            result += AdjustFloat64Exponent(remainder,n+1)
        end
    end
    return result
end

function divbyfour(x)
    return AdjustFloat64Exponent(x,-2)
end

function mymul(A,B)
    return divbyfour(square(A+B) - square(A-B))    
end

println("Now test mymul")
println("sqrt(2.0) * Float64(π)      = ",sqrt(2.0) * Float64(π))
println("mymul(sqrt(2.0),Float64(π)) = ",mymul(sqrt(2.0),Float64(π)))
println()
println()

And here is the output

  Activating project at `~/juliascript/describebits`
Now test mymul
sqrt(2.0) * Float64(π)      = 4.442882938158366
mymul(sqrt(2.0),Float64(π)) = 4.442882938158365

StevenSiew · October 25, 2024, 11:47pm

Here is my current sourcecode which uses bit manipulations to perform

squaring of a Float64 number
multiplication of two Float64 numbers

I have made some changes to improve performance but fundamentally

The algorithmn for squaring is 97% similar to the algorithmn for multiplying two Float64 numbers.

To the point that there is no reason to have a CPU that can square a float number but cannot multiple two floats together. The saving of die space is negligible.

My new Float64BitTools.jl module which uses the package

BitOperations.jl

Here is the source for “/Users/ssiew/juliascript/describebits/Float64BitTools.jl”

module Float64BitTools

# export debugging tools
export generate_correct_arr

# New DescribeFloat64 functions
export NewDescribeFloat64

# ordinal functions of BitOperations.jl
export cardinal_bget, ordinal_bget
export cardinal_bset, ordinal_bset
export cardinal_bflip, ordinal_bflip

# Get the content of an entire Float64 number
export GetFloat64asInt64Array
export GetFloat64asUInt64
export GetFloat64asBitString

# Conversion functions
export ConvertInt64ArraytoBitString
export ConvertBitStringtoInt64Array
export ConvertInt64ArraytoUInt64
export ConvertUInt64toInt64Array

# SignBit functions
export GetFloat64SignBit
export SetFloat64SignBit

# Exponent content functions
export GetFloat64ExponentasInt64Array
export GetFloat64ExponentasBitString
export GetFloat64ExponentasRawExponent
export GetFloat64ExponentasNormalExponent
export SetFloat64ExponentfromRawExponent
export SetFloat64ExponentfromNormalExponent
export AdjustFloat64Exponent

# Mantissa content functions
export SetFloat64MantissafromMantissa
export GetFloat64MantissaasInt64Array
export GetFloat64MantissaasBitString

# Unit testing
export Test_Float64BitTools

using BitOperations: bget as cardinal_bget,
                     bset as cardinal_bset,
                     bflip as cardinal_bflip,
                     BitOperations

###############################################################
#  Float64 Bit details in ordinal nums                        #
#                                                             #
#  signbit      ordinal pos 64                                #
#  exponent     ordinal pos 53 - 63   length = 11             #
#  mantissa     ordinal pos 1 - 52    length = 52             #
###############################################################

signbit_ordinal_pos = 64
exponent_ordinal_range = 53:63
mantissa_ordinal_range =  1:52
Float64ExponentBias = 1023

# Utility to convert integer to UInt64
ui(n::Integer) = convert(UInt64,n)
# Utility to convert integer to string of UInt64
function uis(n::Integer)
    arr = [ '0' for k = 1:16 ]
    str = reverse(string(n,base=16))
    for k = 1:length(str)
        arr[k] = str[k]
    end
    return "0x" * String(reverse(arr))
end

function generate_correct_arr(arr::Array{Int64,1})
    print("correct_x = Int64[")
    print(arr[1])
    for k in 2:length(arr) 
        print(",",arr[k])
    end
    println("]")
end

# an UIint64 Array of Binsry Bit Value
BinaryBitValueArray = UInt64[ 0x0000000000000002 ^ (k - 0x0000000000000001)  for k = 0x0000000000000001:0x0000000000000040]

#####################################################
# ordinal functions of BitOperations.jl             #
#####################################################

function ordinal_bget(x::T, bit::Integer)::Bool where {T<:Integer}
    return cardinal_bget(x,bit - 1)
end

function ordinal_bset(x::T, bit::Integer, y::Bool)::T where {T<:Integer}
    return cardinal_bset(x,bit - 1,y)
end

function ordinal_bset(x::T, bit::Integer)::T where {T<:Integer}
    return cardinal_bset(x,bit - 1)
end

function ordinal_bset(x::T, bits::UnitRange{<:Integer}, y::Integer)::T where {T<:Integer}
    newrange = (bits.start-1):(bits.stop-1)
    return cardinal_bset(x,newrange,y)
end

function ordinal_bflip(x::T, bit::Integer)::T where {T<:Integer}
    return cardinal_bflip(x,bit - 1)
end

function ordinal_bflip(x::T, bits::UnitRange{<:Integer})::T where {T<:Integer}
    newrange = (bits.start-1):(bits.stop-1)
    return cardinal_bflip(x,newrange)
end

#####################################################
# Get the content of an entire Float64 number       #
#####################################################

function GetFloat64asInt64Array(x::Float64)
    xu = reinterpret(UInt64, x)
    result = zeros(Int64,64)
    for k = 1:64
        result[k] = ordinal_bget(xu,k)
    end
    return result
end

function GetFloat64asUInt64(x::Float64)
    return reinterpret(UInt64, x)
end

function GetFloat64asBitString(x::Float64)
    return String(    Char[ k == 0 ? '0' : '1' for k in reverse(GetFloat64asInt64Array(x))]    )
end


#####################################################
# Conversion functions                              #
#####################################################


function ConvertInt64ArraytoBitString(arr::Array{Int64,1})
    return String(    Char[ k == 0 ? '0' : '1' for k in reverse(arr)]    )
end

function ConvertBitStringtoInt64Array(bs::String)
    return Int64[ k == '0' ? 0 : 1  for k in reverse(bs) ]
end

function ConvertInt64ArraytoUInt64(arr::Array{Int64,1})
    global BinaryBitValueArray
    result = 0x0000000000000000
    for k = 1:length(arr)
        result += arr[k] == 1 ? BinaryBitValueArray[k] : 0x0000000000000000
    end
    return result
end

function ConvertUInt64toInt64Array(u::UInt64)
    num = u
    arr = zeros(Int64,64)
    for k = 1:64
        arr[k] = convert(Int64,num % 2)
        num = div(num,2)
    end
    return arr
end

function GetFloat64ExponentasInt64Array(x::Float64)
    xu = reinterpret(UInt64, x)
    result = zeros(Int64,length(exponent_ordinal_range))
    for k = exponent_ordinal_range
        result[k - (exponent_ordinal_range.start - 1)] = ordinal_bget(xu,k)
    end
    return result
end


#####################################################
# SignBit functions                                 #
#####################################################


function GetFloat64SignBit(x::Float64)
    xu = reinterpret(UInt64, x)
    return ordinal_bget(xu,signbit_ordinal_pos)
end

function SetFloat64SignBit(x::Float64,b::Bool)
    xu = reinterpret(UInt64, x)
    xu = ordinal_bset(xu,signbit_ordinal_pos,b)
    xf = reinterpret(Float64,xu)
    return xf
end


#####################################################
# Exponent content functions                        #
#####################################################



function GetFloat64ExponentasBitString(x::Float64)
    return ConvertInt64ArraytoBitString(GetFloat64ExponentasInt64Array(x))
end


function GetFloat64ExponentasRawExponent(x::Float64)
    return convert(Int64,    ConvertInt64ArraytoUInt64(GetFloat64ExponentasInt64Array(x))    )
end

function GetFloat64ExponentasNormalExponent(x::Float64)
    return convert(Int64,GetFloat64ExponentasRawExponent(x)) - Float64ExponentBias
end

function SetFloat64ExponentfromRawExponent(x::Float64,u::UInt64)
    xu = reinterpret(UInt64, x)
    arr = ConvertUInt64toInt64Array(u)
    for k = 1:length(exponent_ordinal_range)
        xu = ordinal_bset(xu,k+(exponent_ordinal_range.start - 1),arr[k]==1)
    end
    xf = reinterpret(Float64,xu)
    return xf
end

function SetFloat64ExponentfromRawExponent(x::Float64,i::Int64)
    u = convert(UInt64,i)
    return SetFloat64ExponentfromRawExponent(x,u)
end

function SetFloat64ExponentfromNormalExponent(x::Float64,i::Int64)
    xu = reinterpret(UInt64, x)
    rawexpo_value = i + Float64ExponentBias
    # Now check for out of range
    if rawexpo_value < 0 
        rawexpo_value = 0
    end
    if rawexpo_value > 2047
        rawexpo_value = 2047
    end
    u = convert(UInt64,rawexpo_value)
    arr = ConvertUInt64toInt64Array(u)
    for k = 1:length(exponent_ordinal_range)
        xu = ordinal_bset(xu,k+(exponent_ordinal_range.start - 1),arr[k]==1)
    end
    xf = reinterpret(Float64,xu)
    return xf
end

function AdjustFloat64Exponent(x::Float64,adj::Int64)
    if isnan(x) || isinf(x)
        return x
    end
    NormalExponent = GetFloat64ExponentasNormalExponent(x)
    if NormalExponent == 1024 ||  NormalExponent == -1023
        "The number is INF or subnormal and we cannot adjust the number"
    else
        # we can adjust the number
        NormalExponent += adj 
        # Now check for subnormal 
        if NormalExponent < -1022 || NormalExponent > 1023
            NormalExponent = -1023
        end
    end
    return SetFloat64ExponentfromNormalExponent(x,NormalExponent)
end


#####################################################
# Mantissa content functions                        #
#####################################################


function SetFloat64MantissafromMantissa(x::Float64,arr::Array{Int64,1})
    xu = reinterpret(UInt64, x)
    for k = 1:length(mantissa_ordinal_range)
        xu = ordinal_bset(xu,k+(mantissa_ordinal_range.start - 1),arr[k]==1)
    end
    xf = reinterpret(Float64,xu)
    return xf
end

@inline function SetFloat64MantissafromMantissa(x::Float64,u::UInt64)
    arr = ConvertUInt64toInt64Array(u)
    return SetFloat64MantissafromMantissa(x,arr)
end

@inline function SetFloat64MantissafromMantissa(x::Float64,i::Int64)
    u = convert(UInt64,i)
    return SetFloat64MantissafromMantissa(x,u)
end

function GetFloat64MantissaasInt64Array(x::Float64)
    xu = reinterpret(UInt64, x)
    result = zeros(Int64,length(mantissa_ordinal_range))
    for k = mantissa_ordinal_range
        result[k - (mantissa_ordinal_range.start - 1)] = ordinal_bget(xu,k)
    end
    return result
end

function GetFloat64MantissaasBitString(x::Float64)
    return ConvertInt64ArraytoBitString(GetFloat64MantissaasInt64Array(x))
end

function NewDescribeFloat64(x::Float64;verbose=false)

    # Local function vprintln
    function vprintln(args...)
        if verbose == true
            println(args...)
        end
    end


    # Local variables used in this function #
    specialflags = false # Boolean flag
    bs = GetFloat64asBitString(x)  # Character bit string
    signbit = bs[1] == '1' ? 1 : 0 # Sign Bit Int64
    expoStr = String(@view bs[2:12]) # a View of bit stream relevant to exponent
    # exponent Array{Int64,1}
    expoArray = ConvertBitStringtoInt64Array(expoStr)
    expo  = GetFloat64ExponentasRawExponent(x)
    mantissaStr = String(@view bs[13:64]) # a View of bit stream relevant to mantissa
    # mantissa Array{Int64,1}
    mantissaArray = ConvertBitStringtoInt64Array(mantissaStr)
    mysignificand = string(significand(x)) # significand Float64
    # Strip the negative sign if second character is a number digit
    if mysignificand[1] == '-' &&  mysignificand[2] == '1' || mysignificand[2] == '0'
        mysignificand = lstrip(mysignificand,'-')
    end

    # List of special Boolean flags
    INF_flag = false # Boolean flag
    NaN_flag = false # Boolean flag
    subnormal_flag = false
    # End of local variables

    println("------------------------------------------------------------")
    println("Float64 value : ",x)
    println("Raw Bitstring : ",bs,"₂")    # \_2<tab> becomes ₂
    println("======================")
    println("Sign Bit : ",bs[1],"₂")
    println("Sign Value : ",signbit)

    println("======================")
    vprintln("Expo Array is ",expoArray)
    println("Exponent Bits : ",expoStr,"₂")  
    println("Exponent raw value : ",expo)
    println("Exponent normalised value : ",expo," - $(Float64ExponentBias) = ",expo - Float64ExponentBias)

    println("======================")
    vprintln("Typeof mantissa is ",typeof(mantissaArray))
    vprintln("mantissa Array is ",mantissaArray)
    println("Mantissa : I.",bs[13:64],"₂")
    println("Significand : ",mysignificand)
 
    println("======================")
    ##############################################
    # Now check ALL conditions for special flags #
    ##############################################

    # Step 1: Checking for INF
    #     If all bits in mantissa are ZERO 
    #        and all bits in exponent are ONE
    #     then INF_flag is true
    if sum(mantissaArray) == 0 && prod(expoArray) == 1
        specialflags = true
        INF_flag = true
    end

    # Step 2: Checking for NaN
    #     If some bits in mantissa are NOT ZERO 
    #        and all bits in exponent are ONE
    #     then NaN_flag is true
    if sum(mantissaArray) > 0 && prod(expoArray) == 1
        specialflags = true
        NaN_flag = true
    end

    # Step 2: Checking for subnormal
    #     If some bits in mantissa are NOT ZERO 
    #        and all bits in exponent are ZERO
    #     then subnormal_flag is true
    if sum(mantissaArray) > 0 && sum(expoArray) == 0
        specialflags = true
        subnormal_flag = true
    end

    print("Value is ")

    if specialflags == false
      println("(-1)^",signbit," * ",mysignificand," * 2^",expo - 1023)
    else # specialflags == true
        if INF_flag
            print(signbit == 1 ? "Negative " : "Positive ")
            println("INF")
        end
        if NaN_flag
            println("NaN")
        end
        if subnormal_flag
            println("subnormal")
        end
    end

    println("Value is ",x)
    println("------------------------------------------------------------")
end

function Test_Float64BitTools()
    println("Testing Float64BitTools")
    correct_a = Int64[1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,1,1,1,1,1,1,1,1,0,0]
    a = GetFloat64asInt64Array(1.2)
    correct_b = "0011111111110011001100110011001100110011001100110011001100110011"
    b = GetFloat64asBitString(1.2)
    println("a = GetFloat64asInt64Array(1.2)")
    println("b = GetFloat64asBitString(1.2)")
    println("a == correct_a : ",a == correct_a)
    println("b == correct_b : ",b == correct_b)
    bb = ConvertInt64ArraytoBitString(a)
    println("bb = ConvertInt64ArraytoBitString(a)")
    println("b == bb : ",b == bb)
    aa = ConvertBitStringtoInt64Array(bb)
    println("aa = ConvertBitStringtoInt64Array(bb)")
    println("a == aa : ",a == aa)
    c = GetFloat64ExponentasInt64Array(1.2)
    println("c = GetFloat64ExponentasInt64Array(1.2)")
    println(c)
    correct_c = Int64[1,1,1,1,1,1,1,1,1,1,0]
    println("c == correct_c : ",c == correct_c)
    d = GetFloat64ExponentasBitString(1.2)
    println("d = GetFloat64ExponentasBitString(1.2)")
    println(d)
    correct_d = "01111111111"
    println("d == correct_d : ",d == correct_d)
    e = ConvertInt64ArraytoUInt64(c)
    correct_e = 0x00000000000003ff
    println("e = ConvertInt64ArraytoUInt64(c)")
    println(e)
    println("e == correct_e : ",e == correct_e)
    ee = ConvertUInt64toInt64Array(e)
    println("ee = ConvertUInt64toInt64Array(e)")
    println("ee == c : ",ee == c)
    f = GetFloat64ExponentasRawExponent(1.2)
    println("f = GetFloat64ExponentasRawExponent(1.2)")
    println(f)
    correct_f = 0x00000000000003ff
    println("f == correct_f : ",f == correct_f)
    g = GetFloat64ExponentasNormalExponent(1.2)
    println("g = GetFloat64ExponentasNormalExponent(1.2)")
    println(g)
    correct_g = 0
    println("g == correct_g : ",g == correct_g)
    h = 1.2
    i = SetFloat64ExponentfromRawExponent(h,f+1)
    println("h=1.2")
    println("i=SetFloat64ExponentfromRawExponent(h,f+1)")
    println(i)
    j = SetFloat64ExponentfromNormalExponent(h,2)
    println("j = SetFloat64ExponentfromNormalExponent(h,2)")
    println(j)
    k = AdjustFloat64Exponent(h,-2)
    println("k = AdjustFloat64Exponent(h,-2)")
    println(k)
    l = SetFloat64MantissafromMantissa(h,0x000000000000000)
    println("l = SetFloat64MantissafromMantissa(h,0x000000000000000)")
    println(l)
    m = SetFloat64MantissafromMantissa(h,0)
    println("m = SetFloat64MantissafromMantissa(h,0)")
    println(m)
    n = GetFloat64MantissaasInt64Array(h)
    println("n = GetFloat64MantissaasInt64Array(h)")
    println(n)
    correct_n = Int64[1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0]
    println("n == correct_n : ",n == correct_n)
    o = GetFloat64MantissaasBitString(h)
    println("o = GetFloat64MantissaasBitString(h)")
    correct_o = "0011001100110011001100110011001100110011001100110011"
    println("o == correct_o : ",o == correct_o)
end

end # module Float64BitTools

Here is the new improved source code to prove that you can multiply without mutliplying, using Float64BitTools

using Pkg
Pkg.activate("/Users/ssiew/juliascript/describebits")

include("/Users/ssiew/juliascript/describebits/Float64BitTools.jl")

using .Float64BitTools
using OffsetArrays

# Use the same algorithmn as square_alternative to perform
# multiplication of two floating point number x and y
function mul_alt(x::Float64,y::Float64)
    # The square of any real number is positive
    # so we might as well treat the input as a positive number
    abs_x = abs(x)

    X_Sign_Bit = GetFloat64SignBit(x)  # Boolean
    X_Normal_Exponent = GetFloat64ExponentasNormalExponent(x)
    X_Mantissa_Reverse = reverse(GetFloat64MantissaasInt64Array(x))

    Y_Sign_Bit = GetFloat64SignBit(y)  # Boolean
    Y_Normal_Exponent = GetFloat64ExponentasNormalExponent(y)
    Y_Mantissa_Reverse = reverse(GetFloat64MantissaasInt64Array(y))

    # Calculate the resultant Sign Bit
    R_Sign_Bit = xor(X_Sign_Bit,Y_Sign_Bit)

    mantissa_range = 1:52

    answer_array_range = (-1):(2*length(mantissa_range))

    guidance_template = zeros(Int64,1+length(mantissa_range))
    guidance = OffsetArray(guidance_template,0:length(mantissa_range))
    subject_template = zeros(Int64,1+length(mantissa_range))
    subject = OffsetArray(subject_template,0:length(mantissa_range))
    result = OffsetArray(subject_template,0:length(mantissa_range))

    answer_array = OffsetArray(  zeros(Int64,length(answer_array_range)),answer_array_range  )

    # Now put in the values
    guidance[0] = 1
    subject[0] = 1

    for k = mantissa_range
      guidance[k] = X_Mantissa_Reverse[k]
      subject[k]  = Y_Mantissa_Reverse[k]
    end

    # start with answer_pos = end
    answer_pos = answer_array_range.stop
    while answer_pos > answer_array_range.start

        # start with v = end
        v = mantissa_range.stop
        carry = 0
        while v >= 0
            # step 1 check if we need to check this line
            # check if we are in subject range
            if guidance[v] == 1 && answer_pos in (0+v):(mantissa_range.stop+v) && subject[answer_pos-v] == 1
                carry += 1
            end
            v -= 1
        end # while v

        k = answer_pos
        # Distribute the carry up the answer_array
        while carry > 0
            # Now we add carry to answer_array at k
            total = carry + answer_array[k]

            # Put the correct bit at k
            correct_bit = total % 2
            answer_array[k] = correct_bit

            # calculate new carry
            carry = div(total,2)
            k -= 1
        end

        answer_pos -= 1
    end # while answer_pos

    # Find position of the first 1 in answer_array
    firstone = answer_array_range.start
    while answer_array[firstone] != 1
        firstone += 1
    end

    # Code to do final rounding up here
    last_valid_pos = firstone + length(mantissa_range)
    if answer_array[last_valid_pos + 1] == 1
        # now we round up last_valid_pos
        local k = last_valid_pos
        local carry = 1

        # Distribute the carry up the answer_array
        while carry > 0
            # Now we add carry to answer_array at k
            total = carry + answer_array[k]

            # Put the correct bit at k
            correct_bit = total % 2
            answer_array[k] = correct_bit

            # calculate new carry
            carry = div(total,2)
            k -= 1
        end

    end 

    # Now we need to rediscover the first 1 in answer_array again
    firstone = answer_array_range.start
    while answer_array[firstone] != 1
        firstone += 1
    end

    # Now calculate the exponent adjustment needed
    expo_adj = 0 - firstone

    # copy to result
    for k in 0:length(mantissa_range)
        result[k] = answer_array[firstone + k]
    end

    result_reverse = zeros(Int64,length(mantissa_range))
    for k = mantissa_range
        result_reverse[k] = result[(length(mantissa_range) + 1) - k]
    end

    # And now put value back in new Float64 number
    result_Float64 = SetFloat64SignBit(abs_x,R_Sign_Bit)
    result_Float64 = SetFloat64MantissafromMantissa(result_Float64,result_reverse)
    return SetFloat64ExponentfromNormalExponent(result_Float64,X_Normal_Exponent + Y_Normal_Exponent + expo_adj)

end

# algorithmn to perform squaring a floating point number
# by manipulating the bits within it
function square_alternative(x::Float64)
    # The square of any real number is positive
    # so we might as well treat the input as a positive number
    abs_x = abs(x)
    Normal_Exponent = GetFloat64ExponentasNormalExponent(abs_x)
    Mantissa_Reverse = reverse(GetFloat64MantissaasInt64Array(abs_x))

    mantissa_range = 1:52

    answer_array_range = (-1):(2*length(mantissa_range))

    guidance_template = zeros(Int64,1+length(mantissa_range))
    guidance = OffsetArray(guidance_template,0:length(mantissa_range))
    subject_template = zeros(Int64,1+length(mantissa_range))
    subject = OffsetArray(subject_template,0:length(mantissa_range))
    result = OffsetArray(subject_template,0:length(mantissa_range))

    answer_array = OffsetArray(  zeros(Int64,length(answer_array_range)),answer_array_range  )

    # Now put in the values
    guidance[0] = 1
    subject[0] = 1

    for k = mantissa_range
      guidance[k] = Mantissa_Reverse[k]
      subject[k]  = Mantissa_Reverse[k]
    end

    # start with answer_pos = end
    answer_pos = answer_array_range.stop
    while answer_pos > answer_array_range.start

        # start with v = end
        v = mantissa_range.stop
        carry = 0
        while v >= 0
            # step 1 check if we need to check this line
            # check if we are in subject range
            if guidance[v] == 1 && answer_pos in (0+v):(mantissa_range.stop+v) && subject[answer_pos-v] == 1
                carry += 1
            end
            v -= 1
        end # while v

        k = answer_pos
        # Distribute the carry up the answer_array
        while carry > 0
            # Now we add carry to answer_array at k
            total = carry + answer_array[k]

            # Put the correct bit at k
            correct_bit = total % 2
            answer_array[k] = correct_bit

            # calculate new carry
            carry = div(total,2)
            k -= 1
        end

        answer_pos -= 1
    end # while answer_pos

    # Find position of the first 1 in answer_array
    firstone = answer_array_range.start
    while answer_array[firstone] != 1
        firstone += 1
    end

    # Code to do final rounding up here
    last_valid_pos = firstone + length(mantissa_range)
    if answer_array[last_valid_pos + 1] == 1
        # now we round up last_valid_pos
        local k = last_valid_pos
        # now we sum up working_array
        local carry = 1

        # Distribute the carry up the answer_array
        while carry > 0
            # Now we add carry to answer_array at k
            total = carry + answer_array[k]

            # Put the correct bit at k
            correct_bit = total % 2
            answer_array[k] = correct_bit

            # calculate new carry
            carry = div(total,2)
            k -= 1
        end

    end 

    # Now we need to rediscover the first 1 in answer_array again
    firstone = answer_array_range.start
    while answer_array[firstone] != 1
        firstone += 1
    end

    # Now calculate the exponent adjustment needed
    expo_adj = 0 - firstone

    # copy to result
    for k in 0:length(mantissa_range)
        result[k] = answer_array[firstone + k]
    end

    result_reverse = zeros(Int64,length(mantissa_range))
    for k = mantissa_range
        result_reverse[k] = result[(length(mantissa_range) + 1) - k]
    end

    # And now put value back in new Float64 number
    result_Float64 = SetFloat64MantissafromMantissa(abs_x,result_reverse)
    return SetFloat64ExponentfromNormalExponent(result_Float64,2*Normal_Exponent + expo_adj)

end

function square(x::Float64)
    # The square of any real number is positive
    # so we might as well treat the input as a positive number
    remainder = abs(x)
    queue = Float64[]
    while remainder > 0.0
        local n = GetFloat64ExponentasNormalExponent(remainder)
        two_to_the_n = SetFloat64MantissafromMantissa(remainder,0)
        two_to_the_2n = AdjustFloat64Exponent(two_to_the_n,n)
        pushfirst!(queue,two_to_the_2n)
        # calc new remainder from old remainder
        remainder = remainder - two_to_the_n
        if remainder > 0.0
            # result = result + 2 * 2^n * remainder
            pushfirst!(queue,AdjustFloat64Exponent(remainder,n+1))
        end
    end
    # println(queue)
    return foldl(+,queue;init=0.0)
end

function divbyfour(x)
    return AdjustFloat64Exponent(x,-2)
end

function mymul(A,B)
    return divbyfour(square_alternative(A+B) - square_alternative(A-B))    
end

println("Now test mymul")
println("sqrt(2.0) * Float64(π)        = ",sqrt(2.0) * Float64(π))
println("mymul(sqrt(2.0),Float64(π))   = ",mymul(sqrt(2.0),Float64(π)))
println("mymul(Float64(π),sqrt(2.0))   = ",mymul(Float64(π),sqrt(2.0)))
println("mul_alt(sqrt(2.0),Float64(π)) = ",mul_alt(sqrt(2.0),Float64(π)))
println("mul_alt(Float64(π),sqrt(2.0)) = ",mul_alt(Float64(π),sqrt(2.0)))
println()
println()

And here is the output

  Activating project at `~/juliascript/describebits`
Now test mymul
sqrt(2.0) * Float64(π)        = 4.442882938158366
mymul(sqrt(2.0),Float64(π))   = 4.442882938158366
mymul(Float64(π),sqrt(2.0))   = 4.442882938158366
mul_alt(sqrt(2.0),Float64(π)) = 4.442882938158366
mul_alt(Float64(π),sqrt(2.0)) = 4.442882938158366

Topic		Replies	Views
How to convert the binary form of a float to an integer, and vice versa? General Usage question , fast-math	17	1591	January 28, 2022
DoubleFloats.jl -> Double128 \| Quad64? Numerics	54	4623	December 14, 2020
Fast division by 2 Numerics	13	477	November 3, 2024
Power function with Integer ( x^4 much slower than xxx*x ) Performance	12	1467	May 18, 2018
Exponentiation of floats Performance question	25	919	July 26, 2024

In Binary Floating Point, you can multiply without multiplying

2.3.2 Gate Complexity Estimation

Related Work

Related topics