In some statistical applications (and probably many others) we sometimes have to evaluate sum(log.(x)) for some large vector x. The obvious improvement to this is sum(log, x), but this still requires computing log for each element of an array, which can be expensive.
Playing around with this a bit, I came up with this:
const logtwo = log(2.0)
function sumlog(x::AbstractArray{T}) where {T}
sig = one(T)
ex = zero(Int64)
for xj in x
if sig > one(T)
sig *= 0.5 * significand(xj)
ex += exponent(xj) + 1 # make up the missing factor of 2 here
else
sig *= significand(xj)
ex += exponent(xj)
end
end
sig + logtwo * ex
end
This comes from two ideas:
x == ldexp(significand(x), exponent(x)), and
Keeping the product balanced around one, to avoid underflow and overflow.
But I often see this off-by-one error, and I’m not sure where it comes from.
Also, I’m guessing the sig > one(T) in the loop might be slowing things down, so maybe this could be faster still? We do need to have something to offset the significand, since it’s always in [1,2) and will otherwise overflow with a long enough vector.
This isn’t a critical thing, but speeding this up could be really nice.
Oh, that’s good! And we even have a lower bound on the number of iterations before we really need to check. So we can let it spin for a while with no branching, then check once and make a bigger correction. Yeah, I like that.
Of course, I thought that might be something obvious.
Thanks to you both! I’ll make some adjustments and post the result here soon.
julia> function sumlog(x::AbstractArray{T}) where {T}
sig = one(T)
ex = zero(exponent(one(T)))
bound = floatmax(T) / 2
for xj in x
sig *= significand(xj)
ex += exponent(xj)
if sig > bound
(a, b) = (significand(sig), exponent(sig))
sig = a
ex += b
end
end
log(sig) + logtwo * ex
end
sumlog
julia> x = 10 * rand(100);
julia> @btime sumlog($x)
127.330 ns (0 allocations: 0 bytes)
129.854
julia> @btime sum(log, $x)
680.462 ns (0 allocations: 0 bytes)
129.854
julia> sumlog(x) - sum(log ∘ big, x)
2.83327e-15
Oh, but 100 iterations isn’t ever enough to overflow, so we should check for larger arrays:
This looks great. Are there downsides to making this the default? Minimally just sum(::typeof(log), x::Array{Union{Float32, Float64}}), although perhaps some mapreduce(f, log, x) could too.
One potential concern is if some hardware might have very fast log implementation, but could be very slow for branches. As I type that, GPUs come to mind, though I haven’t done enough GPU work to know whether a new method for sum(::typeof(log), ::AbstractArray) would end up being unintentionally used in GPU kernels.
Yes, I presume this won’t work with GPU, although perhaps similar tricks might. I would worry a little bit about introducing dispatch ambiguities, e.g. against sum(f, A::CuArray), but perhaps there’s a nice place in the mapreduce chain to attach this cleanly. Maybe sum(log, A; dims) deserves some thought too.
Seems accurate to me on quick examples. But possibly thinking up awful cases would be worthwhile. For speed as well, since you’ll branch more or less often, and IIRC the speed log also varies with the size of the number.
I think these still aren’t taking the branch often, as it’s much slower with ifelse.
A function could go in a package, but defining Base.sum(::typeof(log), x::Array{Float64}) would be piracy. Since log itself is defined in Base, it doesn’t seem crazy for something like this to be added too.
Agreed, but (correct me if I’m wrong Tyler) I read @tbeason as suggesting logsum could itself go into LogExpFunctions.jl. That seems like a good first step. If it makes it into Base as a specialized sum method that’s even better. LogExpFunctions.logsum could be deprecated at that time, or just redefined to call log(sum, x).
I think with some more cleanup, this is a good candidate to go in Base. If we make it logabsprod it can be a big help to speed up LinearAlgebra.logabsdet:
julia> x = LowerTriangular(randn(1000,1000));
julia> using LinearAlgebra
julia> using BenchmarkTools
julia> @btime logabsdet($x)
8.687 μs (0 allocations: 0 bytes)
(-631.836, -1.0)
julia> d = diag(x);
julia> @btime logabsprod($d)
1.202 μs (0 allocations: 0 bytes)
(-631.836, -1.0)
Currently we require an array, and I don’t know of a way to lazily get a diagonal in Base. But I think it’s clear there are ways to get things connected properly.
I suspect that in most performance-sensitive uses of logabsdet there will be a matrix factorization (or other operation) that will dominate the cost. Otherwise, why construct a matrix at all?
Yeah I guess that’s expected, each diagonal element lives very far away from the previous, so unless your cache lines are very long I guess you’ll miss every time. If you include the timing of extracting d = diag(x) I’d guess the lazy alternative would still be faster.