Quick probably basic question on Floats

I was wondering if someone could explain what is going in with the below code to me:

function f3264()
  v32 = rand(Float32, 3)
  @info v32
  v64 = Float64.(v32)
  @info v64
  Δv = v64 .- v32
  @info Δv
  @assert sum(Δv)==0
end


f3264()

Output:

[ Info: Float32[0.674268, 0.45975626, 0.8845744]
[ Info: [0.6742680072784424, 0.4597562551498413, 0.8845744132995605]
[ Info: [0.0, 0.0, 0.0]

I’m not sure what to make of the extra digits present in the Float64 presentation (e.g. why isn’t the first Float64 value 0.6742680000000000.) The subtraction gives zero, but I’m not sure why- is this undefined behavior?

(edit: because we’re not showing all digits)

julia> a = parse(Float32,"0.674268")
0.674268f0

julia> @sprintf("%.30f", a)
"0.674268007278442382812500000000"

in this particular case, 0.674268 can be represented by Float64 exactly so this behavior does look puzzling. But I can understand that it’s hard to decide what is the ‘closest Float64 to a Float32’ since it’s not like integer where you just add a bunch of 0 to the bit representation.

Regarding why they subtract to zero, this is because when you do subtraction, the first thing Julia will be doing is type promotion:

julia> @enter Float32(3) - 3
In -(x, y) at promotion.jl:313
>313  -(x::Number, y::Number) = -(promote(x,y)...)

About to run: (promote)(3.0f0, 3)

and it looks like the following can be guaranteed:

julia> ary = Float32[0.674268, 0.45975626, 0.8845744]
3-element Array{Float32,1}:
 0.674268
 0.45975626
 0.8845744

julia> ary .|> Float64
3-element Array{Float64,1}:
 0.6742680072784424
 0.4597562551498413
 0.8845744132995605

julia> ary .|> Float64 .|> Float32
3-element Array{Float32,1}:
 0.674268
 0.45975626
 0.8845744
3 Likes

Huh- I understand what you are saying, and I find it helpful, but now I’m really confused about how the conversion is working.

I’m not sure how the closest Float64 to these particular Float32s could be anything other than the exact value. I understand this doesn’t break the circular conversion guarantee, but it seems like Julia (or the standard it is basing the Float conversions on) is giving the wrong Float64 values when the correct values are entirely within the capabilities of the Float64 format:

function f3264()
  v32 = rand(Float32, 3)
  @info "v32: $v32"
  v64 = Float64.(v32)
  @info "v64: $v64"
  Δv = v64 .- v32
  @info "Δv: $Δv"

  v64parsed = (x->parse(Float64, x)).((string).(v32))
  @info "v64parsed: $v64parsed"
  Δv64parsed = v64parsed .- v32
  @info "Δv64parsed: $Δv64parsed"
end


f3264()

Output:

[ Info: v32: Float32[0.15743017, 0.3225857, 0.904451]
[ Info: v64: [0.15743017196655273, 0.32258570194244385, 0.9044510126113892]
[ Info: Δv: [0.0, 0.0, 0.0]
[ Info: v64parsed: [0.15743017, 0.3225857, 0.904451]
[ Info: Δv64parsed: [-1.9665527262180404e-9, -1.942443872415822e-9, -1.2611389155203767e-8]

Am I missing something here? I’m hesitant to call it a bug when it might just be an artifact of the relevant IEEE standard, but it sure seems bizarre.

yeah, if you look at what Julia does when casting these two types, you hit intrinsic:

nvm, this number for example, 0.674268 can’t be represented exactly, and the Float32 is actually:

julia> a = parse(Float32,"0.674268")
0.674268f0

julia> @sprintf("%.30f", a)
"0.674268007278442382812500000000"
0.6742680072784423828125

#include <iostream>
#include <cmath>
#include <bits/stdc++.h>
using namespace std;

int main(){
    assert(CHAR_BIT * sizeof (float) == 32);
    assert(CHAR_BIT * sizeof (double) == 64);
    cout.precision(30);
    cout << fixed;
    float  a   = 0.674268;
    double a64 = 0.674268;
    cout << a << endl;
    cout << a64 << endl;
    return 0;
}
// output
0.674268007278442382812500000000
0.674267999999999978477660533827


bonus, for C++, you can’t convert float to double naively)

2 Likes

All of the Float16s, Float32s, and Float64s that are finite (not Inf, not NaN) are really rational numbers:
numerator // denominator, and every denominator is some positive power of 2. When the floating point values are neither tiny nor huge, we can see them easily.

For example:

# what you see is the base 10 representation
#   of a binary fraction

julia>  sqrt2_16bits = sqrt(Float16(2.0))
Float16(1.414)

julia> sqrt2_32bits = sqrt(Float32(2.0))
1.4142135f0

julia> sqrt2_64bits = sqrt(Float64(2.0))
1.4142135623730951

# here are the binary fractions that are used
#    to paint the base 10 numbers

julia> convert(Rational{Int64}, sqrt2_16bits)
181//128

julia> ispow2(denominator(ans))
true

julia> convert(Rational{Int64}, sqrt2_32bits)
11863283//8388608

julia> ispow2(denominator(ans))
true

julia> convert(Rational{Int64}, sqrt2_64bits)
6369051672525773//4503599627370496

julia> ispow2(denominator(ans))
true

Let’s test a bunch of random floats to see if their denominators are powers of 2.

xs16 = rand(Float16, 10_000);
xs32 = rand(Float32, 10_000);
xs64 = rand(Float64, 10_000);
# convert them to rational numbers
rat16 = map(x->convert(Rational{Int64},x), xs16);
rat32 = map(x->convert(Rational{Int64},x), xs32);
rat64 = map(x->convert(Rational{Int64},x), xs64);
# define our test
isdenom_powerof2(x) = ispow2(denominator(x));
# test them
julia> all(map(isdenom_powerof2, rat16)) &&
       all(map(isdenom_powerof2, rat32)) &&
       all(map(isdenom_powerof2, rat64))
true
6 Likes