Speed up loop for logpdf

I have the following code which tries to evaluate the logpdf of multiple categorical variables. It is ok, but apparently it is one of the bottlenecks in my code. Really, all the code is doing is for each datapoint xs[i], select the ith vector from a Vector{Vector{Float64}}, index into it using xs[i], and add the value to a sum.

function logpdf(cat::Categorical{T}, xs::Vector{Int}) where T
  length(xs) != length(cat.logp) && throw(DomainError("lengths do not match"))
  p = zero(T)
  @inbounds for i in axes(xs,1)
    logp = cat.logp[i] # cat.logp isa Vector{Vector{Float64}}
    x = xs[i]
    p += logp[x]
  end
  return p
end

Right now on my M1 I’m getting roughly

  5.708 ns (0 allocations: 0 bytes)

I tried LoopVectorization by tacking a @turbo but am running into this error:

UndefVarError: `logp` not defined in `GenSPN`

Is it possible to make the code faster? Note I cannot use a Matrix.

I’m confused. If xs is a Vector{Float64}, then x is a Float64, and then it can’t be used as an index. So this code looks like it shouldn’t work.

Also, this comment cannot be correct:

You are right - and fixed now. I should have double checked before making impromptu edits.

I think adding @simd here might allow the compiler to perform more optimizations, at the cost of a slightly different result since floating point addition is not associative (Performance Tips · The Julia Language).

One problem I see is that only the second indexing expression is guaranteed to be inbounds:

I would remove the @inbounds. That probably won’t help your performance though.

Loopvectorization.jl probably doesn’t like the jumping around in memory, vectorization is probably hard to achieve here.

1 Like

I did try adding @simd in hopes of a speed up, but it seems I get identical timings.

1 Like

Then I’d say your pure Julia code is already close to optimal (although I’d also remove @inbounds to favor safety over speed), especially given your timings in the ns range.

1 Like

I can’t remember exactly, but I think I @inbounds got me from 7ns to 5ns. :sweat_smile:

Well if you replace axes(xs, 1) with eachindex(xs, cat.logp), then at least two indices are safe (use axes for indexing along dimensions of higher-dimensional arrays, not vectors). But this index is not generally safe: logp[x] . Is it guaranteed to be safe somehow?

The fact that logp is a new vector for each iteration makes it hard (or impossible) to vectorize, you want to extract consecutive, or at least closely spaced samles, but different vectors are probably not close in memory. If each logp has identical size, maybe something can be done.

1 Like

Other than assertion checks outside of the function, no. I think I’ve been convinced to remove @inbounds since it doesn’t seem worth it.

If you use eachindex as suggested, you can try @inbounds on the first two indexing expressions, though it may make no difference:

for i in eachindex(xs, cat.logp)
    @inbounds logp = cat.logp[i]
    @inbounds x = xs[i]
    p += logp[x]
end

What is the typical length of cat.logp and of each of its elements?

Why can’t you use a Matrix?
From what I understand, your Categorical is not from Distributions but some struct holding a Vector{Vector{Float64}} giving the log-probabilities to be applied for each observation. You could just bring these to a common length by including -Inf when the index is out of the support of the distribution:

nmax = maximum(length.(cat.logp))
mat = [get(cat.logp[i], j, -Inf) for i in eachindex(cat.logp), j in 1:nmax]

Taking a step back, the function is pretty short and simple, and takes 5ns with zero allocations currently, do you have reason to believe that that is unreasonably slow for what it does? Eg do you have an implementation in another language that is materially faster?

1 Like