Help defining a custom IBMFloat32 primitive type

I need to consume binary data from an old file format that uses 4-byte IBM floating point as opposed to IEEE floating point (i.e., Float32). The 4-byte IBM floating point type is also known as IBM HFP (hexadecimal floating point):

By reading the Julia documentation, I tried to define the following primitive type with a conversion method to Float32:

"""
    IBMFloat32 <: AbstractFloat

32-bit (4-byte) IBM floating point number type (a.k.a. IBM HFP).
See https://en.wikipedia.org/wiki/IBM_hexadecimal_floating-point.

The 4 bytes stored in the type are organized as follows:

[S | EEEEEEE | FFFFFFFF | FFFFFFFF | FFFFFFFF]

S is the sign bit, EEEEEEE is the 7-bit exponent (base 16, excess-64),
and FFFFFFFF (x3) are the 24 bits (3 bytes) of the fraction.

If S = 1, the number is negative; if S = 0, the number is positive.

The value represented is given by

    (-1)^S Ɨ (FFFFFFFF|FFFFFFFF|FFFFFFFF) Ɨ 16^(EEEEEEE - 64)

If the fraction is zero, the value is zero regardless of the exponent or sign.
"""
primitive type IBMFloat32 <: AbstractFloat 32 end

function convert(::Type{Float32}, x::IBMFloat32)
  # extract 4 bytes from IBMFloat32
  b1, b2, b3, b4 = reinterpret(NTuple{4,UInt8}, x)

  # compute fraction from last 3 bytes
  F = reinterpret(UInt32, (zero(UInt8), b2, b3, b4))

  # if fraction is zero, value is zero
  iszero(F) && return zero(Float32)

  # compute sign from first bit
  S = b1 >> 7 # shift first bit all the way to the right

  # compute exponent from remaining 7 bits
  E = b1 & 0b01111111 # knock out first bit in byte

  Float32((-1)^S * F * 16^(E - 64))
end

I am not sure if the implementation is correct yet. In order to test the implementation I need to read bytes from an IO stream as IBMFloat32. I tried the following:

io = IOBuffer("abcdefghijklmnopqrstuvwxyz")

ibmfloat = read(io, IBMFloat32)

but the method failed with

ERROR: The IO stream does not support reading objects of type IBMFloat32.

What do I need to implement to make this work?

You just need to implement the read method:

Base.read(io::IO, ::Type{IBMFloat32}) = reinterpret(IBMFloat32, read(io, UInt32))
1 Like

I don’t think this conversion is feasible, the base in IBM hexadecimal floating point is 16 instead of 2 (the intermediate powers of 2 are presumably represented by the significand with more leading zeros, effectively less precision), so it has a much wider range (10^-79 to 10^75) compared to Float32 (10^-45 to 10^38). Maybe Float64? The link does say

in order to have similar exponent range in binary, 9 exponent bits would be required.

2 Likes

Should I define a method to convert to Float64 instead? Do you have other suggestions to improve the implementation @Benny ?

You cannot raise an integer to a negative power. If you replace 16 by 16.0, it still gives wrong results.

Here is a (not very polished) implementation that works directly with the bit representations. I’m using UInt32 as input.

function f(x::UInt32)
    f0 = x & 0x00ffffff
    iszero(f0) && return Float64(0)
    e0 = x & 0x7f000000
    s0 = x & 0x80000000
    k = leading_zeros(f0)-8
    f1 = (Int64(f0) << (28 + k + 1)) & 0x000fffffffffffff
    e1 = (Int64(e0) >> 22 - 4*64 - k - 1 + 1023) << 52
    s1 = Int64(s0) << 32
    reinterpret(Float64, s1 | e1 | f1)
end

Examples from the Wikipedia page:

julia> x = 0b1_1000010_011101101010000000000000; f(x)
-118.625

julia> x = 0b0_1111111_111111111111111111111111; f(x)
7.2370051459731155e75

julia> x = 0b0_0000000_000100000000000000000000; f(x)
5.397605346934028e-79
3 Likes

Thank you @matthias314. Could you please comment on the implementation? How do you come up with those hexadecimal literals?

s0, e0, f0 are the bit fields of the sign, the exponent and the fraction in the input. s1, e1 and f1 are those for the output. k is the number of leading zeros in the 24-bit number f0. You need it to adjust the exponent and the fraction. For example, if k == 2, then the (binary) fraction starts with 0.001, but you want it to start with 1., which is a shift by k+1 positions. Int64(e0) >> 22 is 4 times the exponent. (We have to change from base 16 to base 2.) You have to subtract 4*64 and k+1 and add 1023 to get the right bias. Int64(f0) << (28 + k + 1) moves the fraction to the correct position in Float64 (changing from width 24 to 52). However, now the most significant bit set in f0 is at position 2^52. Since it is not stored in Float64, you have to clear it, which I do with & 0x000fffffffffffff.

Does this help?

2 Likes

It certainly helps! I will study it carefully. Thank you.

Unless there’s a smaller binary format with 9 exponent bits and 24 explicit significand bits, then probably. Bonus performance from Float64 operations being implemented in hardware.

I’d also agree with starting from matthias314’s f(x::UInt32). There’s no need to make a new type IBMFloat32 just to read bits to begin with; starting at UInt32 is enough, though you wouldn’t use the function or method convert(::T, x::UInt32) to reinterpret it. It also makes more sense to manipulate bits and reinterpret directly to the output type, rather than implementing the formula (-1)^S * F * 16^(E - 64) with numeric types that don’t need to be involved. I would also point out that the link’s formula is (āˆ’1)^{sign} Ɨ 0.<significand> Ɨ 16^{exponentāˆ’64}, meaning the significand’s bits represent a binary fraction <1, not F::UInt32.

Implementing the IBMFloat32 type, its read, and its convert would be more worth it if you also implement its particular operations, and so far it doesn’t appear you intend to emulate IBM hexadecimal computation and its notably wobbling precision.

1 Like

I agree with @Benny. Apart from the missing factor 2^{-24}, I think the bit order is wrong in the OP code. Here is a modification that should work:

function juliohm(x::UInt32)
  # extract 4 bytes from IBMFloat32
  b1, b2, b3, b4 = reinterpret(NTuple{4,UInt8}, x)

  # compute fraction from first 3 bytes
  F = reinterpret(UInt32, (b1, b2, b3, zero(UInt8)))

  # if fraction is zero, value is zero
  iszero(F) && return zero(Float64)

  # compute sign from highest bit
  S = b4 >> 7 # shift highest bit all the way to the right

  # compute exponent from remaining 7 bits
  E = b4 & 0b01111111 # knock out first bit in byte

  Float64((-1)^S * F * 2.0^(-24) * 16.0^(E - 64))
end
1 Like

Waded through some OS software and strange char representations to this NumPy ufunc library for taking IBM hexadecimals read as UInt32/UInt64 and converting to either Float32 or Float64, along with the aforementioned rounding errors. Might help to translate or wrap the 4 C functions before the NumPy wrapping: ibm2ieee/ibm2ieee/_ibm2ieee.c at main Ā· enthought/ibm2ieee Ā· GitHub. It’s worth checking out the README for usage, especially considering endianness, and the license.

1 Like

@matthias314 could you please explain this line?

I understand that f0 << k+1 would move the radix point to the left as explained in Wikipedia. Could we avoid the 28 here if we wrapped the result in Int64 after the shift?

Int64(f0 << k + 1)

I am trying to simplify the code as much as possible without these offsets related to changing values from 32 to 64 words.

Here is my updated implementation. It is not complete yet because I need some help to make sure I am understanding the bit manipulations correctly. Appreciate any comments:

function Base.convert(::Type{Float64}, x::IBMFloat32)
  b = reinterpret(UInt32, x)
  s = b & 0x80000000
  e = b & 0x7f000000
  f = b & 0x00ffffff

  # if fraction is zero, value is zero
  iszero(f) && return zero(Float64)

  # shift sign bit to the rightmost position
  sieee = Int64(s >> 31)

  # shift exponent bits to the right (3 bytes = 24 bits)
  # and multiply by 4 to convert from base 16 to base 2
  # hence, the resulting shift is 24 - 2 = 22 bits
  eieee = Int64(e >> 22)

  # normalize fraction until the leading digit is equal to 1,
  # and adjust the exponent accordingly (opposite shift)
  d = f & 0x00f00000
  while iszero(d)
    f <<= 4
    eieee -= 4
    d = f & 0x00f00000
  end
  fieee = UInt64(f) << 32 # pad with zeros to the right

  # what is the correct way to combine the parts into Float64?
  sieee, eieee, fieee
end

Do you agree this implementation makes sense up to this point?

The next step is to deal with ā€œbiasesā€ in these floating point formats, which is a concept that I am not familiar with: ibm2ieee/ibm2ieee/_ibm2ieee.c at 164f4d4aee55560a0f4075f1f98d707496ffa669 Ā· enthought/ibm2ieee Ā· GitHub

The highest set bit in f0 (at position 23-k if counting from 0) will be dropped in Float64. The following bit (at position 22-k) has to be moved to position 51. Hence you have to shift it left by 28+k+1 positions. The highest set bit now is at position 52, and the & clears it. So you cannot avoid the 28.

Also, if you want to simplify the code, then instead of computing (in my notation) s0 and s1, you could say

y = reinterpret(Float64, e1 | f1)
return flipsign(y, x % Int32)

The book ā€œNumbers and Computersā€ includes this C function to convert 32-bit IBM S/360 floating-point numbers to IEEE double-precision numbers.

1 Like

Your initial d = f & 0x00f00000 is always non-zero (for non-zero x). Depending on where it has the highest set bit, you have to adjust fieee and eieee.

The next step is to deal with ā€œbiasesā€

For IBM floating point, an exponent of 64 means 0. In IEEE floats, an exponent of 1023 means 0.

what is the correct way to combine the parts into Float64?

After shifting the fields to the correct positions, you use reinterpret.

It might help to look at the example conversions on the Wikipedia pages for these formats and to test your code against them.

@matthias314 just to make sure we are on the same page, could you please suggest minimum modifications to the last version I shared? I thought I had a reasonable solution.

Regarding this comment, I thought I got the issue. Even though I am knocking out the bits, the iszero(d) check cannot be applied directly, I should have shifted the bits to the right before calling iszero, right?

I like this solution @xiaoxi , it is very similar to what I’ve been trying to do. That is, compute s, e and f (or m) and produce the final Float64.

Did you check the license of the code before publishing it? (I did not, I don’t have the book)

The book’s author has a Git repository with the book’s code.

Since I couldn’t find this function there, I’ve decided to edit my post. I apologize for the inconvenience; I thought I could publish it under fair use. I’ll be more careful next time.