I am trying to understand LinearAlgebra.__normalize!
, quoted in full here:
@inline function __normalize!(a::AbstractArray, nrm)
# The largest positive floating point number whose inverse is less than infinity
δ = inv(prevfloat(typemax(nrm)))
if nrm ≥ δ # Safe to multiply with inverse
invnrm = inv(nrm)
rmul!(a, invnrm)
else # scale elements to avoid overflow
εδ = eps(one(nrm))/δ
rmul!(a, εδ)
rmul!(a, inv(nrm*εδ))
end
return a
end
Specifically, what is the two-step branch protecting against?
An example of how it fails without it would be useful too.
A simple example is [nextfloat(0.0)]
. In short, there are floating point numbers greater than zero whose inverse is Inf
. Multiplying by \frac{εδ}{εδ} can avoid it.
julia> a = [nextfloat(0.0)]
1-element Vector{Float64}:
5.0e-324
julia> nrm =norm(a)
5.0e-324
julia> a*inv(nrm)
1-element Vector{Float64}:
Inf
julia> δ = inv(prevfloat(typemax(nrm)))
5.562684646268003e-309
julia> εδ = eps(one(nrm))/δ
3.99168061906944e292
julia> a*εδ
1-element Vector{Float64}:
1.9721522630525295e-31
julia> ans * inv(nrm*εδ)
1-element Vector{Float64}:
1.0
2 Likes
Any idea why specifically that factor (which is basically eps(floatmax(nrm))
)? Seems like floatmin(nrm) / nextfloat(zero(nrm))
is a more intuitive and less extreme value that would do the trick. The point, as I understand it, is just to lift all subnormals into the normals.
1 Like
Thanks for the detailed explanation, now I get it.
On a related note: my understanding was that performing float division as multiplication by the inverse is faster, but less accurate. I was curious how much less accurate it is, so I tested it with random values like
using LinearAlgebra, Statistics
function collect_discrepancy(::Type{T}, N, K) where T
function _f(x)
z1 = normalize(x, 2) # __normalize! under the hook
z2 = x ./ norm(x, 2) # plain vanilla rdiv
xb = BigFloat.(x) # compare to “exact” result
zb = T.(xb ./ norm(xb, 2))
e1 = maximum(abs, z1 .- zb)
e2 = maximum(abs, z2 .- zb)
(; x, e1, e2)
end
[_f(randn(K)) for _ in 1:N]
end
and the results are that yes, it is a bit less accurate, but not significantly, eg
julia> results = collect_discrepancy(Float64, 1000, 16);
julia> mean(r.e1 ≥ r.e2 for r in results)
0.816
julia> mean(r.e1 for r in results)
6.058890368587643e-17
julia> mean(r.e2 for r in results)
5.729100462266434e-17
The same pattern emerges if I look at the outliers, or any other pertinent statistic. So it seems that this optimization is pretty innocuous.
Well, mostly…
julia> LinearAlgebra.normalize([49e0]) # pad vector with as many zeros as you want
1-element Vector{Float64}:
0.9999999999999999
julia> complex(49e0) / complex(49e0) # easter egg: Julia's complex division also uses invert-and-multiply
0.9999999999999999 + 0.0im
If you care that x / x == 1
in a particular situation then best to avoid x * inv(x)
(but generally this isn’t a concern). Also, computing the inverse isn’t free (except at compile time) and SIMD division is a thing, so if you’re doing just a few divisions by a particular number then division tends to be faster.
2 Likes
I wonder if it would make sense to branch when the length is small. Using the (possibly naive) benchmark
using BenchmarkTools, LinearAlgebra
function _time(N)
x = randn(N)
z = randn()
a = @belapsed ($x ./ $z)
b = @belapsed LinearAlgebra.__normalize!(y, $z) setup=(y=copy(x))
N => a / b
end
I get
so there is a tiny saving for small values.
A branch on length isn’t free either and would eat somewhat into this already-small gain. It would also be a tad annoying if zero-padding an array changed the result on the nonzeros (though like I argued above, in most situations one doesn’t care either way). In total, I don’t think the effort and extra code is worth the small gain.
Maybe for a small SArray
one could make a case, but otherwise there isn’t a lot of precedent for aggressively optimizing array operations for small sizes. I was mostly pointing this out to remark that invert-and-multiply has artifacts and isn’t always a performance win versus a limited number of divisions.
Although the second branch in __normalize!
is implemented strangely. Maybe rmul!
is faster than .*=
(although it’s not easy to imagine how it could differ significantly) but doing it twice (as in the else
branch) is definitely going to be slower than fusing those loops (while still preserving the order of operations). That could be an okay PR (I would say “good”, but it’s a pretty rare branch).
The place where I first noticed this is in the complex division case I hinted at above. Our current implementation uses an invert-and-multiply which, in my testing, was marginally slower than divides. I might have made that PR long ago, but I have slightly bigger ambitions for complex division (fixing the x/x != 1
cases, which is tough without measurably reducing the performance) that I might get back to some day.
1 Like
Or messing with AD, I guess.
You are right about optimizing the small case, I don’t think it is worth it.