when converting a Rational type to floating point I’m expecting the result to be up to machine precision. In other words Float64(a//b) should be the closest floating point number to a/b.
I found some instances that do not satisfy this requirement, for example
julia> r = 18940981//3
18940981//3
julia> Float32(r) == Float32(Float64(r))
false
here I’m using Float64 as a proxy for an higher precision calculation. This happens because the numerator is not exactly representable as Float32. I think there exist examples for Float64, but I don’t have one at the moment (I mined this example)
First of all, is my logic correct? Should we care about improving the accuracy of such conversions? I’m using rational numbers to compute high order finite difference stencils, I’m converting them back to floating point when assembling a matrix, so it makes sense to have the most accurate representation
Here is the julia implementation of the conversion:
AbstractFloat(x::Rational) = (float(x.num)/float(x.den))::AbstractFloat
function (::Type{T})(x::Rational{S}) where T<:AbstractFloat where S
P = promote_type(T,S)
convert(T, convert(P,x.num)/convert(P,x.den))::T
end
I have also checked the boost rational implementation and they use the same logic. First convert to the floating point type and then perform the division
The inequality arises because converting the Rational{Int64} 18940981//3 (exactly 6313660 + 1/3) to Float32 directly differs from converting it first to Float64 (higher precision) and then to Float32. On conversion and promotion
Right, the algorithm of first converting the numerator and denominator to the floating-point type, and then doing the division, rounds twice whenever the numerator or denominator is not exactly representable, so it is fundamentally not guaranteed to give you exact rounding.
How does parsing of float literals work? All literals are rational. Granted, literals can only express rationals whose denominators are powers of 10, but powers of 10 aren’t particularly nice in binary, so I’m curious how hard it would be to generalize the algorithm.
Float parsing is simpler because the limited set of denominators allows the use of precomputed tables, but my understanding is that it may still use use higher precision at intermediate steps.
Suppose that the literal has fewer than 19 digits, so that the exact numerator is representable as a UInt64, but it is bigger than 2^{53} so it is not representable as a Float64. In that case, for example, Lemire (2021) “relies on a precomputed table of 128-bit values T[q] for decimal exponents q ∈ [−342, 308]”, and even then it must occasionally “fall back on a higher-precision approach in uncommon instances” for q ∉ [−27, 55].
Can we first separate the rational number into integer and fractional part? This way
if the integer part is already non-exactly representable the fractional part won’t affect the result
else the rounding affects only the fractional part
Regarding comments about the performance, an accurate conversion is made at the end of long rational calculations. If we “cash out” it makes sense to have the highest precision possible
It’s the opposite. The first rounding is exact in Float64 and inexact in Float32. So the Float64 version performs only one rounding, while the Float32 performs 2
Right, and in particular you can see that the Float32(num // den) version is farther from the exact rational than Float32(Float64(num // den)), so the former is not the “correctly rounded” (closest) Float32 value:
In particular, note that Float32(r) differs from r by exactly -\frac{1}{3}, while Float32(Float64(r)) differs from r by exactly \frac{1}{6}, and \frac{1}{3} > \frac{1}{6} so the former is not correctly rounded.
I’m not sure why that would ensure correct rounding? You still have multiple rounding steps in general.
I agree that performance may not be crucial for this operation. Certainly it seems worthwhile to use Float64 even if the final result is Float32. Not sure if it is worth it to use BigFloat?
(Personally, I often use rational to express literal constants in precision-independent code, in which case the conversion is done at compile-time thanks to constant-propagation: julia#32024.)
The advantage of splitting is that you can use an integer div to compute the integral part of the answer exactly, reducing the size of the problem a good deal.
I don’t see how it reduces the problem much in general, since if the denominator is too big to be represented then you will still incur multiple rounding steps in dealing with the fractional part.
And, of course, it doesn’t help at all with rational numbers where the integral part is zero. For example, with:
r = 603869355627654//2358152114619205 # ≈ 0.426…
you have the same problem: Float32(r) is not correctly rounded to the closest Float32 value.
Another thing to keep in mind is that if your rational numbers have huge denominators and you are doing a big rational calculation, they are very likely to overflow Int. And if you are using BigInt you really don’t care about performance for the final Float32 or Float64 conversion.
Whereas for small rational constants like 1//3 there is no problem, and we can quickly check for this fast path by comparing to maxintfloat.
Right, I forgot about this type. For Rational{Int64} this should work, I think? We’d need to implement a better constructor from Int64 (this seems like a bug, in fact):
In the case we want to limit the use of bigger precision computation I was able to figure out one of the branches in the decision tree. Pardon the pseudocode
if denominator is representable
if numerator is representable
trivial case
else
produce integral and fractional part
now the fractional part is composed of representable quantities
divide them using the target precision
sum to the integral part
When the denominator is not representable things gets a little more complicated… I still have to figure it out
I have an update on the proposed conversion routine:
even the happy case where the denominator is representable but the numerator may give an error greater than the representation one due to double rounding…
The general solution would be to manually compute the mantissa bits via shifts and integer division. I’m currently doing a small performance study in which I document the problem and design decisions, at the end it would be a speed comparison between a manual routine and the conversion to BigFloat.
The problem is quite interesting and there are only a few online resources, once the analysis is completed I would like to do a PR with the implementation and some docs