Sum of logs

cscherrer · May 2, 2022, 2:53pm

In some statistical applications (and probably many others) we sometimes have to evaluate sum(log.(x)) for some large vector x. The obvious improvement to this is sum(log, x), but this still requires computing log for each element of an array, which can be expensive.

Playing around with this a bit, I came up with this:

const logtwo = log(2.0)

function sumlog(x::AbstractArray{T}) where {T}
    sig = one(T)
    ex = zero(Int64)
    for xj in x
        if sig > one(T)
            sig *= 0.5 * significand(xj) 
            ex += exponent(xj) + 1       # make up the missing factor of 2 here
        else
            sig *= significand(xj) 
            ex += exponent(xj) 
        end
    end
    sig + logtwo * ex
end

This comes from two ideas:

x == ldexp(significand(x), exponent(x)), and
Keeping the product balanced around one, to avoid underflow and overflow.

It seems to do ok:

julia> x = rand(100);

julia> @btime sumlog($x)
  221.247 ns (0 allocations: 0 bytes)
-106.42

julia> @btime sum(log, $x)
  616.896 ns (0 allocations: 0 bytes)
-107.42

But I often see this off-by-one error, and I’m not sure where it comes from.

Also, I’m guessing the sig > one(T) in the loop might be slowing things down, so maybe this could be faster still? We do need to have something to offset the significand, since it’s always in [1,2) and will otherwise overflow with a long enough vector.

This isn’t a critical thing, but speeding this up could be really nice.

Oscar_Smith · May 2, 2022, 3:02pm

You can make it faster with if sig > floatmax(T)/2 which will branch predict much better (and re-normalize by a bigger power).

GunnarFarneback · May 2, 2022, 3:03pm

I think you mean log(sig) + logtwo * ex.

cscherrer · May 2, 2022, 3:10pm

Oh, that’s good! And we even have a lower bound on the number of iterations before we really need to check. So we can let it spin for a while with no branching, then check once and make a bigger correction. Yeah, I like that.

Of course, I thought that might be something obvious.

Thanks to you both! I’ll make some adjustments and post the result here soon.

cscherrer · May 2, 2022, 3:17pm

Ok now we’re talkin’

julia> function sumlog(x::AbstractArray{T}) where {T}
           sig = one(T)
           ex = zero(exponent(one(T)))
           bound = floatmax(T) / 2
           for xj in x
               sig *= significand(xj) 
               ex += exponent(xj) 
               if sig > bound
                   (a, b) = (significand(sig), exponent(sig))
                   sig = a
                   ex += b
               end
           end
           log(sig) + logtwo * ex
       end
sumlog

julia> x = 10 * rand(100);

julia> @btime sumlog($x)
  127.330 ns (0 allocations: 0 bytes)
129.854

julia> @btime sum(log, $x)
  680.462 ns (0 allocations: 0 bytes)
129.854

julia> sumlog(x) - sum(log ∘ big, x)
2.83327e-15

Oh, but 100 iterations isn’t ever enough to overflow, so we should check for larger arrays:

julia> x = 10 .* rand(100000000);

julia> @time sumlog(x)
  0.128695 seconds (1 allocation: 16 bytes)
1.30265e8

julia> @time sum(log, x)
  0.697396 seconds (1 allocation: 16 bytes)
1.30265e8

mcabbott · May 2, 2022, 3:33pm

This looks great. Are there downsides to making this the default? Minimally just sum(::typeof(log), x::Array{Union{Float32, Float64}}), although perhaps some mapreduce(f, log, x) could too.

cscherrer · May 2, 2022, 3:41pm

I was just wondering this too. Here’s an accuracy comparison:

julia> x = 10 .* rand(1000000);

julia> exact = sum(log ∘ big, x)
1.302404677482756428200856004490688849646850626114695987885127324259071038329214e+06

julia> Float64(sumlog(x) - exact)
-1.078137309892319e-10

julia> Float64(sum(log, x) - exact)
1.2501691266463773e-10

One potential concern is if some hardware might have very fast log implementation, but could be very slow for branches. As I type that, GPUs come to mind, though I haven’t done enough GPU work to know whether a new method for sum(::typeof(log), ::AbstractArray) would end up being unintentionally used in GPU kernels.

mcabbott · May 2, 2022, 3:52pm

Yes, I presume this won’t work with GPU, although perhaps similar tricks might. I would worry a little bit about introducing dispatch ambiguities, e.g. against sum(f, A::CuArray), but perhaps there’s a nice place in the mapreduce chain to attach this cleanly. Maybe sum(log, A; dims) deserves some thought too.

Seems accurate to me on quick examples. But possibly thinking up awful cases would be worthwhile. For speed as well, since you’ll branch more or less often, and IIRC the speed log also varies with the size of the number.

tbeason · May 2, 2022, 3:55pm

Maybe it could go here?

Oscar_Smith · May 2, 2022, 3:56pm

log is mostly constant asked. it’s technically slower for numbers close to 1, but only slightly.

mcabbott · May 2, 2022, 4:08pm

Ok. Strangely then, numbers near 1 seem slightly faster for me, but not by very much:

julia> begin
       x = rand(1000)./100;      @btime sum(log, $x); @btime sumlog($x);
       x = 1 .+ rand(1000)./100; @btime sum(log, $x); @btime sumlog($x);
       x = 100 .+ rand(1000);    @btime sum(log, $x); @btime sumlog($x);
       end;
  6.650 μs (0 allocations: 0 bytes)  # <<1
  2.042 μs (0 allocations: 0 bytes)
  4.863 μs (0 allocations: 0 bytes)  # about 1
  2.042 μs (0 allocations: 0 bytes)
  6.650 μs (0 allocations: 0 bytes)  # >>1
  2.042 μs (0 allocations: 0 bytes)

I think these still aren’t taking the branch often, as it’s much slower with ifelse.

A function could go in a package, but defining Base.sum(::typeof(log), x::Array{Float64}) would be piracy. Since log itself is defined in Base, it doesn’t seem crazy for something like this to be added too.

cscherrer · May 2, 2022, 4:30pm

Agreed, but (correct me if I’m wrong Tyler) I read @tbeason as suggesting logsum could itself go into LogExpFunctions.jl. That seems like a good first step. If it makes it into Base as a specialized sum method that’s even better. LogExpFunctions.logsum could be deprecated at that time, or just redefined to call log(sum, x).

tbeason · May 2, 2022, 4:51pm

Yes that’s what I meant.

cscherrer · May 2, 2022, 5:08pm

https://github.com/JuliaStats/LogExpFunctions.jl/pull/48

cscherrer · May 10, 2022, 12:35pm

I think with some more cleanup, this is a good candidate to go in Base. If we make it logabsprod it can be a big help to speed up LinearAlgebra.logabsdet:

julia> x = LowerTriangular(randn(1000,1000));

julia> using LinearAlgebra

julia> using BenchmarkTools

julia> @btime logabsdet($x)
  8.687 μs (0 allocations: 0 bytes)
(-631.836, -1.0)

julia> d = diag(x);

julia> @btime logabsprod($d)
  1.202 μs (0 allocations: 0 bytes)
(-631.836, -1.0)

Currently we require an array, and I don’t know of a way to lazily get a diagonal in Base. But I think it’s clear there are ways to get things connected properly.

EDIT: I had the sign bit reversed, fixed now

stevengj · May 10, 2022, 12:59pm

I suspect that in most performance-sensitive uses of logabsdet there will be a matrix factorization (or other operation) that will dominate the cost. Otherwise, why construct a matrix at all?

cscherrer · May 10, 2022, 1:19pm

Good point @stevengj . The benefit will mostly be for triangular matrices (shown above) and ~~small static matrices, like this~~:

julia> x = @SArray randn(3,3);

julia> @btime lu($x); # called by logabsdet
  37.268 ns (0 allocations: 0 bytes)

julia> @btime logabsdet($x);
  58.769 ns (0 allocations: 0 bytes)

But I would think it’s always at least as fast as the current implementation, even if the relative improvement is small in many cases.

EDIT: For small collections, log(prod(x)) will still be faster. So maybe large-ish triangular matrices are the main benefit.

baggepinnen · May 10, 2022, 1:42pm

diagind together with view?

cscherrer · May 10, 2022, 2:00pm

That adds a surprising amount of overhead:

julia> x = LowerTriangular(randn(1000,1000));

julia> d = diag(x);

julia> @btime logabsprod($d);
  1.202 μs (0 allocations: 0 bytes)

julia> @btime logabsprod(view($x, diagind($x)));
  5.718 μs (0 allocations: 0 bytes)

Could be due to locality, not sure yet

baggepinnen · May 10, 2022, 2:40pm

Yeah I guess that’s expected, each diagonal element lives very far away from the previous, so unless your cache lines are very long I guess you’ll miss every time. If you include the timing of extracting d = diag(x) I’d guess the lazy alternative would still be faster.

julia> diagind(x)
1:1001:1000000

Topic		Replies	Views
Implementing sum General Usage	11	2938	April 18, 2020
Reduce(logaddexp,x) has arbitrarily better accuracy than logsumexp(x) Numerics	11	613	January 2, 2024
Performance of logarithm calculation (when not to use the dot operator?) Performance	3	1208	April 24, 2018
Fast LogSumExp over 4th dimension Performance tullio , loopvectorization	36	2230	October 21, 2022
Speeding up my logsumexp function Performance performance , array , tullio , loopvectorization	35	5120	October 21, 2022

Sum of logs

Related topics