I have the following code which tries to evaluate the logpdf of multiple categorical variables. It is ok, but apparently it is one of the bottlenecks in my code. Really, all the code is doing is for each datapoint xs[i], select the ith vector from a Vector{Vector{Float64}}, index into it using xs[i], and add the value to a sum.

function logpdf(cat::Categorical{T}, xs::Vector{Int}) where T
length(xs) != length(cat.logp) && throw(DomainError("lengths do not match"))
p = zero(T)
@inbounds for i in axes(xs,1)
logp = cat.logp[i] # cat.logp isa Vector{Vector{Float64}}
x = xs[i]
p += logp[x]
end
return p
end

Right now on my M1 I’m getting roughly

5.708 ns (0 allocations: 0 bytes)

I tried LoopVectorization by tacking a @turbo but am running into this error:

UndefVarError: `logp` not defined in `GenSPN`

Is it possible to make the code faster? Note I cannot use a Matrix.

I think adding @simd here might allow the compiler to perform more optimizations, at the cost of a slightly different result since floating point addition is not associative (Performance Tips · The Julia Language).

Then I’d say your pure Julia code is already close to optimal (although I’d also remove @inbounds to favor safety over speed), especially given your timings in the ns range.

Well if you replace axes(xs, 1) with eachindex(xs, cat.logp), then at least two indices are safe (use axes for indexing along dimensions of higher-dimensional arrays, not vectors). But this index is not generally safe: logp[x] . Is it guaranteed to be safe somehow?

The fact that logp is a new vector for each iteration makes it hard (or impossible) to vectorize, you want to extract consecutive, or at least closely spaced samles, but different vectors are probably not close in memory. If each logp has identical size, maybe something can be done.

Why can’t you use a Matrix?
From what I understand, your Categorical is not from Distributions but some struct holding a Vector{Vector{Float64}} giving the log-probabilities to be applied for each observation. You could just bring these to a common length by including -Inf when the index is out of the support of the distribution:

nmax = maximum(length.(cat.logp))
mat = [get(cat.logp[i], j, -Inf) for i in eachindex(cat.logp), j in 1:nmax]

Taking a step back, the function is pretty short and simple, and takes 5ns with zero allocations currently, do you have reason to believe that that is unreasonably slow for what it does? Eg do you have an implementation in another language that is materially faster?