Hello,

I am writing a statistical spatio-temporal model for clustering, using the target values Y_{it} of the units, together with spatial coordinates {\bf{s}}_i and covariates {\bf{x}}_{it}. The choice of what to include in the clustering process can be decided trough some arguments, i.e. sPPM and xPPM which can be true or false saying if to include or ignore the spatial and covariates information, respectively.

When I am fitting using only the target values Y_i I get good performance (I even beat a competitor model written in C). However, as soon as I include spatial information the performance drops drastically, and the execution becomes way slower.

The fit function is not subdivided into subfunctions but just in βsectionsβ (like update this parameter, now this other, etc), therefore to analyse better the performances I used the TimerOutputs package, where I got these results (fitting the model for 5k iterations on 60 units and 12 time instants):

```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Time Allocations
βββββββββββββββββββββββ ββββββββββββββββββββββββ
Tot / % measured: 629s / 99.5% 497GiB / 100.0%
Section ncalls time %tot avg alloc %tot avg
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
rho 60.0k 388s 62.0% 6.47ms 316GiB 63.7% 5.40MiB
sPPM 4 48.0M 275s 44.0% 5.74ΞΌs 230GiB 46.4% 5.03KiB
sPPM 5 6.73M 38.2s 6.1% 5.68ΞΌs 30.8GiB 6.2% 4.80KiB
gamma 60.0k 236s 37.7% 3.93ms 180GiB 36.3% 3.07MiB
sPPM 2 19.9M 185s 29.5% 9.26ΞΌs 157GiB 31.7% 8.28KiB
sPPM 3 3.30M 12.7s 2.0% 3.86ΞΌs 6.59GiB 1.3% 2.09KiB
sPPM 1 3.30M 2.65s 0.4% 802ns 302MiB 0.1% 96.0B
theta 60.0k 587ms 0.1% 9.79ΞΌs 21.8MiB 0.0% 381B
sigma2h 60.0k 274ms 0.0% 4.57ΞΌs 89.2MiB 0.0% 1.52KiB
eta1 5.00k 200ms 0.0% 40.0ΞΌs 0.00B 0.0% 0.00B
muh 60.0k 170ms 0.0% 2.83ΞΌs 0.00B 0.0% 0.00B
phi1 5.00k 168ms 0.0% 33.7ΞΌs 24.3MiB 0.0% 4.98KiB
lambda2 5.00k 54.7ms 0.0% 10.9ΞΌs 8.40MiB 0.0% 1.72KiB
tau2 60.0k 53.7ms 0.0% 895ns 0.00B 0.0% 0.00B
phi0 5.00k 53.2ms 0.0% 10.6ΞΌs 5.95MiB 0.0% 1.22KiB
beta 60.0k 33.3ms 0.0% 555ns 0.00B 0.0% 0.00B
alpha 5.00k 31.9ms 0.0% 6.38ΞΌs 7.32MiB 0.0% 1.50KiB
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```

The problem as said should be the sPPM parts (since without them everything runs quickly). For example the βsPPM 4β section takes a lot of time and memory, however are just these lines of code:

```
@timeit to "sPPM 4" begin
if sPPM
# extract the units of label kk
indexes = findall(jj -> rho_tmp[jj]==kk, 1:n)
# take their spatial coordinates
s1n = @view sp1[indexes]
s2n = @view sp2[indexes]
# add the cohesion to the log posterior probability variable
lpp += spatial_cohesion(spatial_cohesion_idx, s1n, s2n, sp_params, lg=true, M=M_dp)
end
end
```

I was thinking that the performance drop could be due to

- the amount of time it is called (which I think it is not negotiable: it loops on units, time, subclusters, but itβs how the fitting algorithm works, and itβs the same of what the other C model does), so maybe there is nothing much to do
- the fact that i preallocated just the important variables, like the ones saving the iterates of the model (so for example I set
`Si_iter = ones(Int64,n,T)`

,`theta_iter = zeros(T)`

and all the other parameters in this way before the start of the fitting algorithm). But instead I did not preallocate all the βtemporaryβ variables such as the indexes or the rho_tmp. So maybe I should add also them to recycle their use at every iteration - the βindirectβ path of the computation, since I call the
`spatial_cohesion`

function, which in turn calls the right cohesion function based on the passed`spatial_cohesion_idx`

; and this could add some overhead, since for example in the C competitor (which stays reasonably quick also with spatial information) it is directly called a certain cohesion function (without this indirect but more flexible idea).

For completeness the function definition is

```
function spatial_cohesion(idx::Real, s1::AbstractVector{Float64}, s2::AbstractVector{Float64}, sp_params; lg::Bool, M::Real)
# idx should reasonably be ::Int but I'm calling this function from R (using JuliaConnectoR)
# where everything is converted into Float, that's why i put ::Real
idx==1.0 && return cohesion1(s1,s2,sp_params...,lg=lg,M=M)
idx==2.0 && return cohesion2(s1,s2,sp_params...,lg=lg,M=M)
idx==3.0 && return cohesion3_4(s1,s2,sp_params...,lg=lg,M=M,Cohesion=3)
idx==4.0 && return cohesion3_4(s1,s2,sp_params...,lg=lg,M=M,Cohesion=4)
idx==5.0 && return cohesion5(s1,s2,sp_params...,lg=lg,M=M)
idx==6.0 && return cohesion6(s1,s2,sp_params...,lg=lg,M=M)
end
```

where I use the splat operator since each cohesion has a different number and types of parameters. Maybe it could also be the reason of some overhead. And still for example

```
function cohesion3_4(s1::AbstractVector{Float64}, s2::AbstractVector{Float64}, mu_0::AbstractVector{Float64}, k0::Real, v0::Real, Psi::Matrix{Float64}; Cohesion::Int, lg::Bool, M::Real=1.0)
sdim = length(s1)
sp = [s1 s2]
sbar = vec(mean(sp, dims=1))
S = zeros(2,2)
for i in 1:sdim
vtemp = sp[i,:] - sbar
S += (vtemp)*(vtemp)'
end
# compute updated parametersfor cohesion 3
kn = k0+sdim
vn = v0+sdim
Psi_n = Psi + S + (k0*sdim)/(k0+sdim)*(sbar-mu_0)*(sbar-mu_0)'
if Cohesion == 3
out = -sdim * log(Ο) + G2a(0.5 * vn, true) - G2a(0.5 * v0, true) + 0.5 * v0 * logdet(Psi) - 0.5 * vn * logdet(Psi_n) + log(k0) - log(kn)
return lg ? out : exp(out)
end
# or for cohesion 4
knn = kn+sdim
vnn = vn+sdim
mu_n = (k0*mu_0 + sdim*sbar)/(k0+sdim)
Psi_nn = Psi_n + S + (kn*sdim)/(kn+sdim)*(sbar-mu_n)*(sbar-mu_n)'
if Cohesion == 4
out = -sdim * log(Ο) + G2a(0.5 * vnn, true) - G2a(0.5 * vn, true) + 0.5 * vn * logdet(Psi_n) - 0.5 * vnn * logdet(Psi_nn) + log(kn) - log(knn)
return lg ? out : exp(out)
end
end
```

While the `@view`

s usage were just a simple idea to maybe gain some performance, but apparently they are not the main problem.

Itβs my first time writing such βcomplexβ and really-performance-oriented functions (since for example when moving on fitting with more serious parameters, e.g. 100k iterations, these models can literally take hours to run) so Iβm not really sure on how to move and any advice is therefore welcome.