Sum of logs

In some statistical applications (and probably many others) we sometimes have to evaluate sum(log.(x)) for some large vector x. The obvious improvement to this is sum(log, x), but this still requires computing log for each element of an array, which can be expensive.

Playing around with this a bit, I came up with this:

const logtwo = log(2.0)

function sumlog(x::AbstractArray{T}) where {T}
    sig = one(T)
    ex = zero(Int64)
    for xj in x
        if sig > one(T)
            sig *= 0.5 * significand(xj) 
            ex += exponent(xj) + 1       # make up the missing factor of 2 here
        else
            sig *= significand(xj) 
            ex += exponent(xj) 
        end
    end
    sig + logtwo * ex
end

This comes from two ideas:

  1. x == ldexp(significand(x), exponent(x)), and
  2. Keeping the product balanced around one, to avoid underflow and overflow.

It seems to do ok:

julia> x = rand(100);

julia> @btime sumlog($x)
  221.247 ns (0 allocations: 0 bytes)
-106.42

julia> @btime sum(log, $x)
  616.896 ns (0 allocations: 0 bytes)
-107.42

But I often see this off-by-one error, and I’m not sure where it comes from.

Also, I’m guessing the sig > one(T) in the loop might be slowing things down, so maybe this could be faster still? We do need to have something to offset the significand, since it’s always in [1,2) and will otherwise overflow with a long enough vector.

This isn’t a critical thing, but speeding this up could be really nice.

4 Likes

You can make it faster with if sig > floatmax(T)/2 which will branch predict much better (and re-normalize by a bigger power).

3 Likes

I think you mean log(sig) + logtwo * ex.

2 Likes

Oh, that’s good! And we even have a lower bound on the number of iterations before we really need to check. So we can let it spin for a while with no branching, then check once and make a bigger correction. Yeah, I like that.

:man_facepalming: Of course, I thought that might be something obvious.

Thanks to you both! I’ll make some adjustments and post the result here soon.

1 Like

Ok now we’re talkin’

julia> function sumlog(x::AbstractArray{T}) where {T}
           sig = one(T)
           ex = zero(exponent(one(T)))
           bound = floatmax(T) / 2
           for xj in x
               sig *= significand(xj) 
               ex += exponent(xj) 
               if sig > bound
                   (a, b) = (significand(sig), exponent(sig))
                   sig = a
                   ex += b
               end
           end
           log(sig) + logtwo * ex
       end
sumlog

julia> x = 10 * rand(100);

julia> @btime sumlog($x)
  127.330 ns (0 allocations: 0 bytes)
129.854

julia> @btime sum(log, $x)
  680.462 ns (0 allocations: 0 bytes)
129.854

julia> sumlog(x) - sum(log ∘ big, x)
2.83327e-15

Oh, but 100 iterations isn’t ever enough to overflow, so we should check for larger arrays:

julia> x = 10 .* rand(100000000);

julia> @time sumlog(x)
  0.128695 seconds (1 allocation: 16 bytes)
1.30265e8

julia> @time sum(log, x)
  0.697396 seconds (1 allocation: 16 bytes)
1.30265e8
14 Likes

This looks great. Are there downsides to making this the default? Minimally just sum(::typeof(log), x::Array{Union{Float32, Float64}}), although perhaps some mapreduce(f, log, x) could too.

2 Likes

I was just wondering this too. Here’s an accuracy comparison:

julia> x = 10 .* rand(1000000);

julia> exact = sum(log ∘ big, x)
1.302404677482756428200856004490688849646850626114695987885127324259071038329214e+06

julia> Float64(sumlog(x) - exact)
-1.078137309892319e-10

julia> Float64(sum(log, x) - exact)
1.2501691266463773e-10

One potential concern is if some hardware might have very fast log implementation, but could be very slow for branches. As I type that, GPUs come to mind, though I haven’t done enough GPU work to know whether a new method for sum(::typeof(log), ::AbstractArray) would end up being unintentionally used in GPU kernels.

Yes, I presume this won’t work with GPU, although perhaps similar tricks might. I would worry a little bit about introducing dispatch ambiguities, e.g. against sum(f, A::CuArray), but perhaps there’s a nice place in the mapreduce chain to attach this cleanly. Maybe sum(log, A; dims) deserves some thought too.

Seems accurate to me on quick examples. But possibly thinking up awful cases would be worthwhile. For speed as well, since you’ll branch more or less often, and IIRC the speed log also varies with the size of the number.

2 Likes

Maybe it could go here?

1 Like

log is mostly constant asked. it’s technically slower for numbers close to 1, but only slightly.

Ok. Strangely then, numbers near 1 seem slightly faster for me, but not by very much:

julia> begin
       x = rand(1000)./100;      @btime sum(log, $x); @btime sumlog($x);
       x = 1 .+ rand(1000)./100; @btime sum(log, $x); @btime sumlog($x);
       x = 100 .+ rand(1000);    @btime sum(log, $x); @btime sumlog($x);
       end;
  6.650 μs (0 allocations: 0 bytes)  # <<1
  2.042 μs (0 allocations: 0 bytes)
  4.863 μs (0 allocations: 0 bytes)  # about 1
  2.042 μs (0 allocations: 0 bytes)
  6.650 μs (0 allocations: 0 bytes)  # >>1
  2.042 μs (0 allocations: 0 bytes)

I think these still aren’t taking the branch often, as it’s much slower with ifelse.

A function could go in a package, but defining Base.sum(::typeof(log), x::Array{Float64}) would be piracy. Since log itself is defined in Base, it doesn’t seem crazy for something like this to be added too.

Agreed, but (correct me if I’m wrong Tyler) I read @tbeason as suggesting logsum could itself go into LogExpFunctions.jl. That seems like a good first step. If it makes it into Base as a specialized sum method that’s even better. LogExpFunctions.logsum could be deprecated at that time, or just redefined to call log(sum, x).

2 Likes

Yes that’s what I meant.

1 Like
3 Likes

I think with some more cleanup, this is a good candidate to go in Base. If we make it logabsprod it can be a big help to speed up LinearAlgebra.logabsdet:

julia> x = LowerTriangular(randn(1000,1000));

julia> using LinearAlgebra

julia> using BenchmarkTools

julia> @btime logabsdet($x)
  8.687 μs (0 allocations: 0 bytes)
(-631.836, -1.0)

julia> d = diag(x);

julia> @btime logabsprod($d)
  1.202 μs (0 allocations: 0 bytes)
(-631.836, -1.0)

Currently we require an array, and I don’t know of a way to lazily get a diagonal in Base. But I think it’s clear there are ways to get things connected properly.

EDIT: I had the sign bit reversed, fixed now

1 Like

I suspect that in most performance-sensitive uses of logabsdet there will be a matrix factorization (or other operation) that will dominate the cost. Otherwise, why construct a matrix at all?

Good point @stevengj . The benefit will mostly be for triangular matrices (shown above) and small static matrices, like this:

julia> x = @SArray randn(3,3);

julia> @btime lu($x); # called by logabsdet
  37.268 ns (0 allocations: 0 bytes)

julia> @btime logabsdet($x);
  58.769 ns (0 allocations: 0 bytes)

But I would think it’s always at least as fast as the current implementation, even if the relative improvement is small in many cases.

EDIT: For small collections, log(prod(x)) will still be faster. So maybe large-ish triangular matrices are the main benefit.

diagind together with view?

That adds a surprising amount of overhead:

julia> x = LowerTriangular(randn(1000,1000));

julia> d = diag(x);

julia> @btime logabsprod($d);
  1.202 μs (0 allocations: 0 bytes)

julia> @btime logabsprod(view($x, diagind($x)));
  5.718 μs (0 allocations: 0 bytes)

Could be due to locality, not sure yet

Yeah I guess that’s expected, each diagonal element lives very far away from the previous, so unless your cache lines are very long I guess you’ll miss every time. If you include the timing of extracting d = diag(x) I’d guess the lazy alternative would still be faster.

julia> diagind(x)
1:1001:1000000
1 Like