In some statistical applications (and probably many others) we sometimes have to evaluate sum(log.(x)) for some large vector x. The obvious improvement to this is sum(log, x), but this still requires computing log for each element of an array, which can be expensive.
Playing around with this a bit, I came up with this:
const logtwo = log(2.0)
function sumlog(x::AbstractArray{T}) where {T}
sig = one(T)
ex = zero(Int64)
for xj in x
if sig > one(T)
sig *= 0.5 * significand(xj)
ex += exponent(xj) + 1 # make up the missing factor of 2 here
else
sig *= significand(xj)
ex += exponent(xj)
end
end
sig + logtwo * ex
end
This comes from two ideas:
x == ldexp(significand(x), exponent(x)), and
Keeping the product balanced around one, to avoid underflow and overflow.
But I often see this off-by-one error, and Iām not sure where it comes from.
Also, Iām guessing the sig > one(T) in the loop might be slowing things down, so maybe this could be faster still? We do need to have something to offset the significand, since itās always in [1,2) and will otherwise overflow with a long enough vector.
This isnāt a critical thing, but speeding this up could be really nice.
Oh, thatās good! And we even have a lower bound on the number of iterations before we really need to check. So we can let it spin for a while with no branching, then check once and make a bigger correction. Yeah, I like that.
Of course, I thought that might be something obvious.
Thanks to you both! Iāll make some adjustments and post the result here soon.
julia> function sumlog(x::AbstractArray{T}) where {T}
sig = one(T)
ex = zero(exponent(one(T)))
bound = floatmax(T) / 2
for xj in x
sig *= significand(xj)
ex += exponent(xj)
if sig > bound
(a, b) = (significand(sig), exponent(sig))
sig = a
ex += b
end
end
log(sig) + logtwo * ex
end
sumlog
julia> x = 10 * rand(100);
julia> @btime sumlog($x)
127.330 ns (0 allocations: 0 bytes)
129.854
julia> @btime sum(log, $x)
680.462 ns (0 allocations: 0 bytes)
129.854
julia> sumlog(x) - sum(log ā big, x)
2.83327e-15
Oh, but 100 iterations isnāt ever enough to overflow, so we should check for larger arrays:
This looks great. Are there downsides to making this the default? Minimally just sum(::typeof(log), x::Array{Union{Float32, Float64}}), although perhaps some mapreduce(f, log, x) could too.
One potential concern is if some hardware might have very fast log implementation, but could be very slow for branches. As I type that, GPUs come to mind, though I havenāt done enough GPU work to know whether a new method for sum(::typeof(log), ::AbstractArray) would end up being unintentionally used in GPU kernels.
Yes, I presume this wonāt work with GPU, although perhaps similar tricks might. I would worry a little bit about introducing dispatch ambiguities, e.g. against sum(f, A::CuArray), but perhaps thereās a nice place in the mapreduce chain to attach this cleanly. Maybe sum(log, A; dims) deserves some thought too.
Seems accurate to me on quick examples. But possibly thinking up awful cases would be worthwhile. For speed as well, since youāll branch more or less often, and IIRC the speed log also varies with the size of the number.
I think these still arenāt taking the branch often, as itās much slower with ifelse.
A function could go in a package, but defining Base.sum(::typeof(log), x::Array{Float64}) would be piracy. Since log itself is defined in Base, it doesnāt seem crazy for something like this to be added too.
Agreed, but (correct me if Iām wrong Tyler) I read @tbeason as suggesting logsum could itself go into LogExpFunctions.jl. That seems like a good first step. If it makes it into Base as a specialized sum method thatās even better. LogExpFunctions.logsum could be deprecated at that time, or just redefined to call log(sum, x).
I think with some more cleanup, this is a good candidate to go in Base. If we make it logabsprod it can be a big help to speed up LinearAlgebra.logabsdet:
julia> x = LowerTriangular(randn(1000,1000));
julia> using LinearAlgebra
julia> using BenchmarkTools
julia> @btime logabsdet($x)
8.687 μs (0 allocations: 0 bytes)
(-631.836, -1.0)
julia> d = diag(x);
julia> @btime logabsprod($d)
1.202 μs (0 allocations: 0 bytes)
(-631.836, -1.0)
Currently we require an array, and I donāt know of a way to lazily get a diagonal in Base. But I think itās clear there are ways to get things connected properly.
I suspect that in most performance-sensitive uses of logabsdet there will be a matrix factorization (or other operation) that will dominate the cost. Otherwise, why construct a matrix at all?
Yeah I guess thatās expected, each diagonal element lives very far away from the previous, so unless your cache lines are very long I guess youāll miss every time. If you include the timing of extracting d = diag(x) Iād guess the lazy alternative would still be faster.