I need to consume binary data from an old file format that uses 4-byte IBM floating point as opposed to IEEE floating point (i.e., Float32). The 4-byte IBM floating point type is also known as IBM HFP (hexadecimal floating point):
By reading the Julia documentation, I tried to define the following primitive type with a conversion method to Float32:
"""
IBMFloat32 <: AbstractFloat
32-bit (4-byte) IBM floating point number type (a.k.a. IBM HFP).
See https://en.wikipedia.org/wiki/IBM_hexadecimal_floating-point.
The 4 bytes stored in the type are organized as follows:
[S | EEEEEEE | FFFFFFFF | FFFFFFFF | FFFFFFFF]
S is the sign bit, EEEEEEE is the 7-bit exponent (base 16, excess-64),
and FFFFFFFF (x3) are the 24 bits (3 bytes) of the fraction.
If S = 1, the number is negative; if S = 0, the number is positive.
The value represented is given by
(-1)^S × (FFFFFFFF|FFFFFFFF|FFFFFFFF) × 16^(EEEEEEE - 64)
If the fraction is zero, the value is zero regardless of the exponent or sign.
"""
primitive type IBMFloat32 <: AbstractFloat 32 end
function convert(::Type{Float32}, x::IBMFloat32)
# extract 4 bytes from IBMFloat32
b1, b2, b3, b4 = reinterpret(NTuple{4,UInt8}, x)
# compute fraction from last 3 bytes
F = reinterpret(UInt32, (zero(UInt8), b2, b3, b4))
# if fraction is zero, value is zero
iszero(F) && return zero(Float32)
# compute sign from first bit
S = b1 >> 7 # shift first bit all the way to the right
# compute exponent from remaining 7 bits
E = b1 & 0b01111111 # knock out first bit in byte
Float32((-1)^S * F * 16^(E - 64))
end
I am not sure if the implementation is correct yet. In order to test the implementation I need to read bytes from an IO stream as IBMFloat32. I tried the following:
I don’t think this conversion is feasible, the base in IBM hexadecimal floating point is 16 instead of 2 (the intermediate powers of 2 are presumably represented by the significand with more leading zeros, effectively less precision), so it has a much wider range (10^-79 to 10^75) compared to Float32 (10^-45 to 10^38). Maybe Float64? The link does say
in order to have similar exponent range in binary, 9 exponent bits would be required.
s0, e0, f0 are the bit fields of the sign, the exponent and the fraction in the input. s1, e1 and f1 are those for the output. k is the number of leading zeros in the 24-bit number f0. You need it to adjust the exponent and the fraction. For example, if k == 2, then the (binary) fraction starts with 0.001, but you want it to start with 1., which is a shift by k+1 positions. Int64(e0) >> 22 is 4 times the exponent. (We have to change from base 16 to base 2.) You have to subtract 4*64 and k+1 and add 1023 to get the right bias. Int64(f0) << (28 + k + 1) moves the fraction to the correct position in Float64 (changing from width 24 to 52). However, now the most significant bit set in f0 is at position 2^52. Since it is not stored in Float64, you have to clear it, which I do with & 0x000fffffffffffff.
Unless there’s a smaller binary format with 9 exponent bits and 24 explicit significand bits, then probably. Bonus performance from Float64 operations being implemented in hardware.
I’d also agree with starting from matthias314’s f(x::UInt32). There’s no need to make a new type IBMFloat32 just to read bits to begin with; starting at UInt32 is enough, though you wouldn’t use the function or method convert(::T, x::UInt32) to reinterpret it. It also makes more sense to manipulate bits and reinterpret directly to the output type, rather than implementing the formula (-1)^S * F * 16^(E - 64) with numeric types that don’t need to be involved. I would also point out that the link’s formula is (−1)^{sign} × 0.<significand> × 16^{exponent−64}, meaning the significand’s bits represent a binary fraction <1, not F::UInt32.
Implementing the IBMFloat32 type, its read, and its convert would be more worth it if you also implement its particular operations, and so far it doesn’t appear you intend to emulate IBM hexadecimal computation and its notably wobbling precision.
I agree with @Benny. Apart from the missing factor 2^{-24}, I think the bit order is wrong in the OP code. Here is a modification that should work:
function juliohm(x::UInt32)
# extract 4 bytes from IBMFloat32
b1, b2, b3, b4 = reinterpret(NTuple{4,UInt8}, x)
# compute fraction from first 3 bytes
F = reinterpret(UInt32, (b1, b2, b3, zero(UInt8)))
# if fraction is zero, value is zero
iszero(F) && return zero(Float32)
# compute sign from highest bit
S = b4 >> 7 # shift highest bit all the way to the right
# compute exponent from remaining 7 bits
E = b4 & 0b01111111 # knock out first bit in byte
Float64((-1)^S * F * 2.0^(-24) * 16.0^(E - 64))
end