Hi,

First time ever writing julia code. I’ve spend some time trying to optimise this function, mainly making it type-stable, and separate some precomputations. But maybe this is not the best way to go.

The resulting code is a lot faster than it’s python `mpmath`

equivalent (about 100x faster, thanks a lot julia for natively supporting BigFloats!), but i wander if i can squeeze more out of it: this code gets integrated into a loss function, and will therefore be evaluated a lot.

Do your advanced eyes see obvious performance bottleneck in the following code that my do not ? Especially, i wander if the `CartesianIndex.()`

syntax is not an issue here.

Here’s the code :

```
# A structure for precomputations.
struct PreComp
BINS
LAGUERRE
FACTS
PREC
MAX_SUM_OF_M
function PreComp(precision, m)
setprecision(precision)
m = big(m)
BINS = zeros(eltype(m),(m,m))
FACTS = zeros(eltype(m),m)
LAGUERRE = zeros(BigFloat,(m,m))
for i in 1:m
FACTS[i] = factorial(i-1)
end
for i in 1:m, j in 1:m
BINS[i,j] = binomial(i-1,j-1)
LAGUERRE[i,j] = -BINS[i,j]/FACTS[j]*(-big(2.0))^(j-1)
end
new(BINS,LAGUERRE,FACTS,precision,m)
end
end
const P = PreComp(2000,50)
# The main function
function get_coefficients(alpha, scales, m)
# alpha must be an array with size (n,)
# scales must be an array with size (n,d)
# max_p must be a Tuple of integers with size (d,)
# Trim down null alphas values:
are_pos = alpha .!= 0
scales = scales[are_pos,:]
alpha = alpha[are_pos]
# Allocates ressources, construct the simplex expression of the scales and the indexes.
coefs = zeros(eltype(alpha),m)
n = size(scales)[1]
d = ndims(coefs)
kappa = zeros(eltype(alpha),size(coefs))
mu = deepcopy(kappa)
S = scales ./ (big(1.0) .+ sum(scales,dims=2))
I = CartesianIndices(coefs)
for k in I
if k == I[1]
# Edge case
kappa[1] = sum(alpha .* log.(big(1.0) .- sum(S,dims=2)))
coefs[1] = mu[1] = exp(kappa[1])
else
# Indices and organisation
tk = Tuple(k)
k_arr = [ki for ki in tk]
degree = findfirst(tk .!= 1)
sk = sum(k_arr)
k_minus_one_in_deg = copy(k_arr)
k_minus_one_in_deg[degree] -= 1
# Main Loop
kappa[k] = P.FACTS[sk-d] * sum(alpha .* prod(S .^ transpose(big.(k_arr .- 1)),dims=2))
for j in CartesianIndices(k)
j_arr = [i for i in Tuple(j)]
if j[degree] < k[degree]
mu[k] += mu[j] * kappa[k - j + I[1]] * prod(P.BINS[CartesianIndex.(k_minus_one_in_deg,j_arr)])
end
coefs[k] += mu[j] * prod(P.LAGUERRE[CartesianIndex.(k_arr,j_arr)])
end
end
end
coefs *= sqrt(big(2.0))^d
return coefs
end
```

Here are some benchmark of applications :

```
# Some benchmarks :
setprecision(2000)
using BenchmarkTools
println("1D test")
alpha = [10, 10, 10]
scales = [0.5; 0; 1]
alpha = convert.(BigFloat,alpha)
scales = convert.(BigFloat,scales)
m = (20)
@btime coefs1 = get_coefficients(alpha, scales,m)
println("2D test")
alpha = [10, 10]
scales = [0.5 0.1; 0.1 0.5]
alpha = convert.(BigFloat,alpha)
scales = convert.(BigFloat,scales)
m = (10,10)
@btime coefs2 = get_coefficients(alpha, scales,m)
println("3D test")
alpha = [10, 10, 10, 10]
scales = [0.5 0.0 0.0; 0.0 0.5 0.0; 0.0 0.0 0.5; 1 1 1]
alpha = convert.(BigFloat,alpha)
scales = convert.(BigFloat,scales)
m = (10,10,10)
@btime coefs3 = get_coefficients(alpha, scales,m)
println("4D test")
alpha = [10, 10, 5, 8, 9, 10, 10]
scales = [0.5 0.0 0.0 0.0; 0.0 0.5 0.0 0.0; 0.0 0.0 0.5 0.0; 1 1 1 1; 2 2 2 2; 2 0 2 0; 0 2 0 2]
alpha = convert.(BigFloat,alpha)
scales = convert.(BigFloat,scales)
m = (5,5,5,5)
@btime coefs4 = get_coefficients(alpha, scales,m)
```

The corresponding output on my machine is :

```
1D test
500.500 μs (4160 allocations: 551.06 KiB)
2D test
7.774 ms (60988 allocations: 7.47 MiB)
3D test
475.745 ms (3869963 allocations: 459.11 MiB)
4D test
194.400 ms (1380041 allocations: 162.29 MiB)
```

What do you think ?