Fast LogSumExp over 4th dimension

jroon · July 6, 2021, 11:21pm

Hi all,
I’m rewriting a function from R to Julia that reduces an array across the 4th dimension via LogSumExp, and I’m currently trying to optimise the speed. I’ve made a bit of progress through the Tullio package, but I’m sure there is more speed to be got.

Here are my attempts so far with toy data:

using BenchmarkTools
using Tullio

function logsumexp4D1(Arr4d::Array{Float64, 4})
   max_ = maximum(Arr4d, dims = (4))
   lse = max_ + log.( sum( exp.(Arr4d .- max_), dims = (4)) )
   dropdims(lse, dims = 4)
end

function logsumexp4D2(Arr4d::Array{Float64, 4})
   max_ = dropdims( maximum(Arr4d, dims = (4)), dims=4)
   @tullio sum_exp[i,j,k] := exp(Arr4d[i,j,k,l] - max_[i,j,k])
   max_ + log.(sum_exp)
end

function logsumexp4D3(Arr4d::Array{Float64, 4})
   @tullio (max) max_[i,j,k] := Arr4d[i,j,k,l]
   @tullio sum_exp[i,j,k] := exp(Arr4d[i,j,k,l] - max_[i,j,k])
   max_ + log.(sum_exp)
end

function logsumexp4D4(Arr4d::Array{Float64, 4})
   @tullio (max) max_[i,j,k] := Arr4d[i,j,k,l]
   @tullio sum_exp[i,j,k] := exp(Arr4d[i,j,k,l] - max_[i,j,k])
   @tullio lse[i,j,k] := max_[i,j,k] + log(sum_exp[i,j,k])
end


A = rand(20, 1000, 80, 5)

julia> @btime logsumexp4D1($A);
 90.804 ms (50 allocations: 109.86 MiB)

julia> @btime logsumexp4D2($A);
 37.490 ms (152 allocations: 48.83 MiB)

julia> @btime logsumexp4D3($A);
 24.381 ms (237 allocations: 48.84 MiB)

julia> @btime logsumexp4D4($A);
 13.715 ms (336 allocations: 36.64 MiB)

julia> logsumexp4D1(A) == logsumexp4D2(A) == logsumexp4D3(A) == logsumexp4D4(A)
false

julia> logsumexp4D1(A) == logsumexp4D2(A) == logsumexp4D3(A)
true

julia> isapprox(logsumexp4D1(A), logsumexp4D4(A))
true

Ok so that is some progress, but I’m a novice and I’m sure folks who know more can beat those times. One thing of note is that for function4 the results are not identical, but are approximately equal (which is good enough). Strangely though, the better performance of function 4 does not translate to R:

Any comments or suggestions welcome
Thanks everyone!

ericphanson · July 6, 2021, 11:31pm

There is an implementation in https://github.com/JuliaStats/LogExpFunctions.jl as well, by the way

dlakelan · July 7, 2021, 4:33am

The 4th version using tullio for everything seems like it’s quite good. fast and succinct. I’m not sure you’re going to get much better.

jroon · July 7, 2021, 6:01am

Hi @ericphanson - thanks for letting me know about that package I was unaware.

So I just tested it too - relatively slow I’m afraid:

julia> @btime logsumexp($A, dims = 4);
  87.023 ms (12 allocations: 36.62 MiB)

jroon · July 7, 2021, 12:57pm

Ah that is surprising its basically my second time writing a Julia function (I’m familiar with numpy.einsum in Python though so that helps).

Hmm… I think then I’ll have to stick with my R version. Unless there is more improvement to be found here, the overhead of moving data from R to Julia which isn’t captured in my R benchmarks above that erases small benefits (Fast 4D argmax - #23 by Non-Contradiction)

Oscar_Smith · July 7, 2021, 1:21pm

You could very easily get a factor of 2 to 3 by using Float32, but that may not be accurate enough for some use cases.

jroon · July 7, 2021, 1:48pm

It might be. I’ve unit testing set up on the R side so if it causes issues I should detect it straight away. How do I do that is it as simply as changing the array declaration code to Float32 ?

dlakelan · July 7, 2021, 2:00pm

Yes, but you clearly are getting the hang of Julia. Perhaps you can move somewhat more of your computation out of R.

Ive been using R since before version 1, back in 1999 or so. And while it’s an impressive piece of software, it doesn’t hold a candle to Julia’s elegance and power and comprehensibility. I think Hadley has particularly borked R comprehensibility through vast amounts of nonstandard evaluation.

I’ve been jumping ship everywhere I can. I’m at that stage where I have to have the docs open all the time, but hey it’s worth it.

stevengj · July 7, 2021, 2:05pm

In general, you won’t get much benefit from Julia from moving a single vectorized function to Julia. See also this post: Comparing Numba and Julia for a complex matrix computation - #3 by stevengj

jroon · July 7, 2021, 4:31pm

No no don’t be fooled… I just happen to know from before about einstein summation notation, and as Tullio uses it that makes Tullio a bit easier to pick up for me… otherwise I have no clue what I’m doing
So I think I’ll link to myself in another thread here to answer as to my motivation with Julia: Fast 4D argmax - #26 by jroon Short version - little time for wholesale conversions but I’m willing to dabble and Rcpp is beyond me. I think if you want to convert users to Julia from R, you aren’t competing with the tidyerse, you are competing with Rcpp, because speeding up particularly slow chunks of code is usually why R users turn to another language.

My R version is not a single vectorised function though, its a short 7 line function. I had alot of success replacing another short R function with Julia before with much help from kind folks in this thread: Fast 4D argmax - #22 by jroon

dlakelan · July 7, 2021, 4:34pm

then perhaps you can post the R code here that you’re trying to replace? Also maybe the context in which this function is called? Perhaps what’s needed is to replace the next higher up level of the computation?

jroon · July 7, 2021, 5:12pm

Sure I can post the R code for the function - although it may not be very readable because its written for speed. But the context is this sits in nested loops. I don’t want to convert any higher up stuff at present time, because its complicated. Besides this function is the slow part and marginal gains here make big difference overall.

library(matrixStats)

logSumExp4D <- function (array4d) {
    dims <- dim(array4d)[1:3]
    dropdim <- dim(array4d)[4]
    len_res <- prod( dims )   # length of results object
    dim( array4d ) <- c( len_res, dropdim ) # reshape array4d from 4D to 2D
    rmaxs <- rowMaxs(array4d) # get max of each row
    res <- rmaxs + log(rowSums( exp( (array4d) - rmaxs ) ) ) # perform logsumexp calc
    res <- array(res, dim = dims) #reshape for returning to calling function
    return(res)
}

Edit: I forgot that the matrixStats package is needed.

Elrod · July 7, 2021, 6:05pm

Try adding using LoopVectorization before defining your functions.
Without:

julia> A = rand(20, 100, 40, 5);

julia> @btime logsumexp4D1($A);
  4.307 ms (39 allocations: 5.49 MiB)

julia> @btime logsumexp4D2($A);
  2.129 ms (270 allocations: 2.46 MiB)

julia> @btime logsumexp4D3($A);
  995.302 μs (268 allocations: 2.46 MiB)

julia> @btime logsumexp4D4($A);
  573.942 μs (310 allocations: 1.85 MiB)

With:

julia> A = rand(20, 100, 40, 5);

julia> @btime logsumexp4D1($A);
  4.335 ms (38 allocations: 5.49 MiB)

julia> @btime logsumexp4D2($A);
  2.035 ms (269 allocations: 2.46 MiB)

julia> @btime logsumexp4D3($A);
  842.333 μs (269 allocations: 2.46 MiB)

julia> @btime logsumexp4D4($A);
  177.153 μs (310 allocations: 1.85 MiB)

For full sized A, there wasn’t a difference because the problem is dominated by memory bandwidth.
For CPUs with smaller cache, you’d likely have to make A even smaller before seeing a difference.

jroon · July 7, 2021, 7:15pm

Thanks for the suggestion @Elrod. I dabbled with it and got some minor speed ups. I also tried the Float32 idea @Oscar_Smith which also made only minor differences as the type conversion has a time cost too (maybe there are faster ways to type convert I dont’ know about?).

My latest function variations are:

using BenchmarkTools
using Tullio
using LogExpFunctions
#using LoopVectorization

function logsumexp4D1(Arr4d::Array{Float64, 4})
    max_ = dropdims( maximum(Arr4d, dims = (4)), dims =4)
    max_ + log.( sum( exp.(Arr4d .- max_), dims = (4)) )
end

function logsumexp4D2(Arr4d::Array{Float64, 4})
    max_ = dropdims( maximum(Arr4d, dims = (4)), dims=4)
    @tullio sum_exp[i,j,k] := exp(Arr4d[i,j,k,l] - max_[i,j,k])
    max_ + log.(sum_exp)
end

function logsumexp4D3(Arr4d::Array{Float64, 4})
    @tullio (max) max_[i,j,k] := Arr4d[i,j,k,l]
    @tullio sum_exp[i,j,k] := exp(Arr4d[i,j,k,l] - max_[i,j,k])
    max_ + log.(sum_exp)
end

function logsumexp4D3b(Arr4d::Array{Float32, 4})
    @tullio (max) max_[i,j,k] := Arr4d[i,j,k,l]
    @tullio sum_exp[i,j,k] := exp(Arr4d[i,j,k,l] - max_[i,j,k])
    max_ + log.(sum_exp)
end

function logsumexp4D4(Arr4d::Array{Float64, 4})
    @tullio (max) max_[i,j,k] := Arr4d[i,j,k,l]
    @tullio sum_exp[i,j,k] := exp(Arr4d[i,j,k,l] - max_[i,j,k])
    @tullio lse[i,j,k] := max_[i,j,k] + log(sum_exp[i,j,k])
end

function logsumexp4D4b(Arr4d::Array{Float32, 4})
    @tullio (max) max_[i,j,k] := Arr4d[i,j,k,l]
    @tullio sum_exp[i,j,k] := exp(Arr4d[i,j,k,l] - max_[i,j,k])
    @tullio lse[i,j,k] := max_[i,j,k] + log(sum_exp[i,j,k])
end

Without loopvectorisation:

A = rand(20, 1000, 80, 5)

julia> @btime logsumexp4D1($A);
  98.591 ms (50 allocations: 109.86 MiB)

julia> @btime logsumexp4D2($A);
  40.131 ms (152 allocations: 48.83 MiB)

julia> @btime logsumexp4D3($A);
  25.950 ms (237 allocations: 48.84 MiB)

julia> @btime logsumexp4D3b(convert(Array{Float32}, $A));
  27.879 ms (239 allocations: 54.94 MiB)

julia> @btime logsumexp4D4($A);
  15.049 ms (336 allocations: 36.64 MiB)

julia> @btime logsumexp4D4b(convert(Array{Float32}, $A));
  16.407 ms (339 allocations: 48.84 MiB)

With loopvectorisation:

A = rand(20, 1000, 80, 5)

julia> @btime logsumexp4D1($A);
  89.592 ms (50 allocations: 109.86 MiB)

julia> @btime logsumexp4D2($A);
  36.349 ms (152 allocations: 48.83 MiB)

julia> @btime logsumexp4D3($A);
  26.731 ms (2000237 allocations: 128.18 MiB)

julia> @btime logsumexp4D3b(convert(Array{Float32}, $A));
  28.140 ms (1280240 allocations: 103.77 MiB)

julia> @btime logsumexp4D4($A);
  14.678 ms (2000338 allocations: 115.98 MiB)

julia> @btime logsumexp4D4b(convert(Array{Float32}, $A));
  13.992 ms (1280338 allocations: 97.67 MiB)

The type conversion itself takes:

julia> @btime convert(Array{Float32}, $A);
  5.990 ms (3 allocations: 30.52 MiB)

dlakelan · July 7, 2021, 7:28pm

My guess is if you’re putting this inside nested loops that you’re creating a bunch of these tensors and then contracting them, but inside the loop you’re transferring them to julia and then doing this fast calculation on them. What you most likely need is to juliaify the nested loops, so you’re not transferring data between R and julia inside the nested loop.

for i ...
  for j ... 
   for k ...
     call julia

would become

call julia to loop *and* contract

jroon · July 9, 2021, 12:00pm

@Oscar_Smith So as regards the float32 approach. Alas when I bring the data back into R its failing my unit tests, which use a tolerance of 1.490116e-08.

If I compare in Julia using isapprox, using Float64 or Float32 are ok:

julia> isapprox(logsumexp4D4(A), logsumexp4D4b(Float32.(A)))
true

But if I compare in R its not ok:

> res4 <- julia_eval("logsumexp4D4(A)", "R")
> res5 <- julia_eval("logsumexp4D4b(convert(Array{Float32}, A))", "R")
> all.equal(res4, res5)
[1] "Mean relative difference: 3.293736e-08"

Does anyone know what tolerance Julia’s isapprox() uses for Float32 comparisons ? I’ve not been able to find out via seach.

ericphanson · July 9, 2021, 12:06pm

It uses

julia> sqrt(eps(Float32))
0.00034526698f0

The docstring for isapprox describes it as √eps but it wasn’t entirely clear to me if that was eps of the types or the values, so I checked by doing

julia> @edit isapprox(1f0, 10f0)

and taking a quick look at the code.

jroon · July 9, 2021, 12:22pm

Oh wow. I guess that idea won’t work
Thanks!

jroon · July 29, 2021, 12:19am

Got distracted from this for a while but looking at it again today. I feel as though the tullio could be speeded up with a function but I’m not sure quite how to implement it.

My two best versions of the code are:

function logsumexp4D5(Arr4d::Array{Float64, 4})
    @tullio (max) max_[i,j,k] := @inbounds Arr4d[i,j,k,l]
    @tullio sum_exp[i,j,k] := exp(Arr4d[i,j,k,l] - max_[i,j,k])
    @tullio lse[i,j,k] := max_[i,j,k] + log(sum_exp[i,j,k])
end

function logsumexp4D6(Arr4d::Array{Float64, 4})
    @tullio (max) max_[i,j,k] := @inbounds Arr4d[i,j,k,l]
    @tullio lsum_exp[i,j,k] := log <| exp(Arr4d[i,j,k,l] - max_[i,j,k])
    @tullio lse[i,j,k] := max_[i,j,k] + lsum_exp[i,j,k]
end

…with timings…:

A = rand(20, 1000, 80, 5);

julia> @btime logsumexp4D5($A);
  13.720 ms (336 allocations: 36.64 MiB)

julia> @btime logsumexp4D6($A);
  17.930 ms (296 allocations: 36.63 MiB)

@mcabbott is there a way to combine the second and third calls to tullio'? The third call uses i,j,k in each term so I thought there might be a way to define a function to substitute for log` that could do the log and add back the max_ in a single step, but I cannot quite work out how to do this.

Many thanks!

mcabbott · July 29, 2021, 12:50am

Yes, this ought to be possible, with |> (or <|, they are equivalent) you can use an underscore to make an anonymous function. However, this is done in a messy way internally, which makes it unnecessarily hard for LoopVectorization to understand, so I get an error unless I disable that (in logsumexp4D8).

A smaller thing you can do is to re-use the array max_, which helps a little:

julia> @btime logsumexp4D5($A);
  7.480 ms (151 allocations: 36.63 MiB)

julia> @btime logsumexp4D6($A);
  8.637 ms (151 allocations: 36.63 MiB)

julia> function logsumexp4D7(Arr4d::Array{Float64, 4})
           @tullio (max) max_[i,j,k] := Arr4d[i,j,k,l]  # @inbounds does nothing
           @tullio sum_exp[i,j,k] := exp(Arr4d[i,j,k,l] - max_[i,j,k])
           @tullio max_[i,j,k] = max_[i,j,k] + log(sum_exp[i,j,k])  # can re-use max_
       end;

julia> @btime logsumexp4D7($A);
  6.305 ms (149 allocations: 24.42 MiB)

julia> function logsumexp4D8(Arr4d::Array{Float64, 4})
           @tullio (max) max_[i,j,k] := Arr4d[i,j,k,l]
           @tullio _[i,j,k] := exp(Arr4d[i,j,k,l] - max_[i,j,k]) |> log(_) + max_[i,j,k]  avx=false
       end
logsumexp4D8 (generic function with 1 method)

julia> @btime logsumexp4D8($A);
  10.622 ms (104 allocations: 24.42 MiB)

Topic		Replies	Views
Speeding up my logsumexp function Performance performance , array , tullio , loopvectorization	35	5062	October 21, 2022
Tullio seems two times slower than basic LoopVectorization Performance question , tullio , loopvectorization	3	1046	April 10, 2022
Fast 4D argmax Performance tullio	26	2099	April 6, 2021
Speed comparison matrix multiplication in Julia Performance question , linearalgebra , optimization , tullio	45	3192	August 19, 2021
Fast logsumexp Performance benchmark	14	7560	June 26, 2019

Fast LogSumExp over 4th dimension

Related topics