Speed up loop for logpdf

Ian_L · October 23, 2024, 3:50am

I have the following code which tries to evaluate the logpdf of multiple categorical variables. It is ok, but apparently it is one of the bottlenecks in my code. Really, all the code is doing is for each datapoint xs[i], select the ith vector from a Vector{Vector{Float64}}, index into it using xs[i], and add the value to a sum.

function logpdf(cat::Categorical{T}, xs::Vector{Int}) where T
  length(xs) != length(cat.logp) && throw(DomainError("lengths do not match"))
  p = zero(T)
  @inbounds for i in axes(xs,1)
    logp = cat.logp[i] # cat.logp isa Vector{Vector{Float64}}
    x = xs[i]
    p += logp[x]
  end
  return p
end

Right now on my M1 I’m getting roughly

  5.708 ns (0 allocations: 0 bytes)

I tried LoopVectorization by tacking a @turbo but am running into this error:

UndefVarError: `logp` not defined in `GenSPN`

Is it possible to make the code faster? Note I cannot use a Matrix.

DNF · October 23, 2024, 5:15am

I’m confused. If xs is a Vector{Float64}, then x is a Float64, and then it can’t be used as an index. So this code looks like it shouldn’t work.

Also, this comment cannot be correct:

Ian_L · October 23, 2024, 5:27am

You are right - and fixed now. I should have double checked before making impromptu edits.

gdalle · October 23, 2024, 5:32am

I think adding @simd here might allow the compiler to perform more optimizations, at the cost of a slightly different result since floating point addition is not associative (Performance Tips · The Julia Language).

DNF · October 23, 2024, 5:33am

One problem I see is that only the second indexing expression is guaranteed to be inbounds:

I would remove the @inbounds. That probably won’t help your performance though.

Loopvectorization.jl probably doesn’t like the jumping around in memory, vectorization is probably hard to achieve here.

Ian_L · October 23, 2024, 5:34am

I did try adding @simd in hopes of a speed up, but it seems I get identical timings.

gdalle · October 23, 2024, 5:35am

Then I’d say your pure Julia code is already close to optimal (although I’d also remove @inbounds to favor safety over speed), especially given your timings in the ns range.

Ian_L · October 23, 2024, 5:36am

I can’t remember exactly, but I think I @inbounds got me from 7ns to 5ns.

DNF · October 23, 2024, 5:40am

Well if you replace axes(xs, 1) with eachindex(xs, cat.logp), then at least two indices are safe (use axes for indexing along dimensions of higher-dimensional arrays, not vectors). But this index is not generally safe: logp[x] . Is it guaranteed to be safe somehow?

The fact that logp is a new vector for each iteration makes it hard (or impossible) to vectorize, you want to extract consecutive, or at least closely spaced samles, but different vectors are probably not close in memory. If each logp has identical size, maybe something can be done.

Ian_L · October 23, 2024, 5:42am

Other than assertion checks outside of the function, no. I think I’ve been convinced to remove @inbounds since it doesn’t seem worth it.

DNF · October 23, 2024, 6:34am

If you use eachindex as suggested, you can try @inbounds on the first two indexing expressions, though it may make no difference:

for i in eachindex(xs, cat.logp)
    @inbounds logp = cat.logp[i]
    @inbounds x = xs[i]
    p += logp[x]
end

gdalle · October 23, 2024, 6:54am

What is the typical length of cat.logp and of each of its elements?

bertschi · October 23, 2024, 6:54am

Why can’t you use a Matrix?
From what I understand, your Categorical is not from Distributions but some struct holding a Vector{Vector{Float64}} giving the log-probabilities to be applied for each observation. You could just bring these to a common length by including -Inf when the index is out of the support of the distribution:

nmax = maximum(length.(cat.logp))
mat = [get(cat.logp[i], j, -Inf) for i in eachindex(cat.logp), j in 1:nmax]

nilshg · October 23, 2024, 7:00am

Taking a step back, the function is pretty short and simple, and takes 5ns with zero allocations currently, do you have reason to believe that that is unreasonably slow for what it does? Eg do you have an implementation in another language that is materially faster?

Topic		Replies	Views
Speed up simple product accumulator loop Performance loops , tullio	6	876	August 17, 2020
Performance of logarithm calculation (when not to use the dot operator?) Performance	3	1225	April 24, 2018
Fast logsumexp Performance benchmark	14	7672	June 26, 2019
Performance of log.(x) General Usage performance , broadcast	11	2735	April 12, 2019
Help understanding vectorization (or lack thereof) Performance	15	1261	June 8, 2018

Speed up loop for logpdf

Related topics