Improving the performance of a spatial clustering model

fmor · August 24, 2024, 8:55am

Hello,
I am writing a statistical spatio-temporal model for clustering, using the target values Y_{it} of the units, together with spatial coordinates {\bf{s}}_i and covariates {\bf{x}}_{it}. The choice of what to include in the clustering process can be decided trough some arguments, i.e. sPPM and xPPM which can be true or false saying if to include or ignore the spatial and covariates information, respectively.

When I am fitting using only the target values Y_i I get good performance (I even beat a competitor model written in C). However, as soon as I include spatial information the performance drops drastically, and the execution becomes way slower.

The fit function is not subdivided into subfunctions but just in “sections” (like update this parameter, now this other, etc), therefore to analyse better the performances I used the TimerOutputs package, where I got these results (fitting the model for 5k iterations on 60 units and 12 time instants):

───────────────────────────────────────────────────────────────────────
                               Time                    Allocations      
                      ───────────────────────   ────────────────────────
   Tot / % measured:        629s /  99.5%            497GiB / 100.0%    

 Section      ncalls     time    %tot     avg     alloc    %tot      avg
 ───────────────────────────────────────────────────────────────────────
  rho          60.0k     388s   62.0%  6.47ms    316GiB   63.7%  5.40MiB
    sPPM 4     48.0M     275s   44.0%  5.74μs    230GiB   46.4%  5.03KiB
    sPPM 5     6.73M    38.2s    6.1%  5.68μs   30.8GiB    6.2%  4.80KiB
  gamma        60.0k     236s   37.7%  3.93ms    180GiB   36.3%  3.07MiB
    sPPM 2     19.9M     185s   29.5%  9.26μs    157GiB   31.7%  8.28KiB
    sPPM 3     3.30M    12.7s    2.0%  3.86μs   6.59GiB    1.3%  2.09KiB
    sPPM 1     3.30M    2.65s    0.4%   802ns    302MiB    0.1%    96.0B
  theta        60.0k    587ms    0.1%  9.79μs   21.8MiB    0.0%     381B
  sigma2h      60.0k    274ms    0.0%  4.57μs   89.2MiB    0.0%  1.52KiB
  eta1         5.00k    200ms    0.0%  40.0μs     0.00B    0.0%    0.00B
  muh          60.0k    170ms    0.0%  2.83μs     0.00B    0.0%    0.00B
  phi1         5.00k    168ms    0.0%  33.7μs   24.3MiB    0.0%  4.98KiB
  lambda2      5.00k   54.7ms    0.0%  10.9μs   8.40MiB    0.0%  1.72KiB
  tau2         60.0k   53.7ms    0.0%   895ns     0.00B    0.0%    0.00B
  phi0         5.00k   53.2ms    0.0%  10.6μs   5.95MiB    0.0%  1.22KiB
  beta         60.0k   33.3ms    0.0%   555ns     0.00B    0.0%    0.00B
  alpha        5.00k   31.9ms    0.0%  6.38μs   7.32MiB    0.0%  1.50KiB
 ───────────────────────────────────────────────────────────────────────

The problem as said should be the sPPM parts (since without them everything runs quickly). For example the “sPPM 4” section takes a lot of time and memory, however are just these lines of code:

@timeit to "sPPM 4" begin
	if sPPM
		# extract the units of label kk
		indexes = findall(jj -> rho_tmp[jj]==kk, 1:n)
		# take their spatial coordinates
		s1n = @view sp1[indexes]
		s2n = @view sp2[indexes]
		# add the cohesion to the log posterior probability variable
		lpp += spatial_cohesion(spatial_cohesion_idx, s1n, s2n, sp_params, lg=true, M=M_dp) 
	end
end

I was thinking that the performance drop could be due to

the amount of time it is called (which I think it is not negotiable: it loops on units, time, subclusters, but it’s how the fitting algorithm works, and it’s the same of what the other C model does), so maybe there is nothing much to do
the fact that i preallocated just the important variables, like the ones saving the iterates of the model (so for example I set Si_iter = ones(Int64,n,T), theta_iter = zeros(T) and all the other parameters in this way before the start of the fitting algorithm). But instead I did not preallocate all the “temporary” variables such as the indexes or the rho_tmp. So maybe I should add also them to recycle their use at every iteration
the “indirect” path of the computation, since I call the spatial_cohesion function, which in turn calls the right cohesion function based on the passed spatial_cohesion_idx; and this could add some overhead, since for example in the C competitor (which stays reasonably quick also with spatial information) it is directly called a certain cohesion function (without this indirect but more flexible idea).

For completeness the function definition is

function spatial_cohesion(idx::Real, s1::AbstractVector{Float64}, s2::AbstractVector{Float64}, sp_params; lg::Bool, M::Real)
    # idx should reasonably be ::Int but I'm calling this function from R (using JuliaConnectoR) 
    # where everything is converted into Float, that's why i put ::Real
	idx==1.0 && return cohesion1(s1,s2,sp_params...,lg=lg,M=M) 
	idx==2.0 && return cohesion2(s1,s2,sp_params...,lg=lg,M=M) 
	idx==3.0 && return cohesion3_4(s1,s2,sp_params...,lg=lg,M=M,Cohesion=3) 
	idx==4.0 && return cohesion3_4(s1,s2,sp_params...,lg=lg,M=M,Cohesion=4) 
	idx==5.0 && return cohesion5(s1,s2,sp_params...,lg=lg,M=M) 
	idx==6.0 && return cohesion6(s1,s2,sp_params...,lg=lg,M=M) 
end

where I use the splat operator since each cohesion has a different number and types of parameters. Maybe it could also be the reason of some overhead. And still for example

function cohesion3_4(s1::AbstractVector{Float64}, s2::AbstractVector{Float64}, mu_0::AbstractVector{Float64}, k0::Real, v0::Real, Psi::Matrix{Float64}; Cohesion::Int, lg::Bool, M::Real=1.0)
	sdim = length(s1)
	sp = [s1 s2]
	sbar = vec(mean(sp, dims=1))
	S = zeros(2,2)
	for i in 1:sdim
		vtemp = sp[i,:] - sbar
		S += (vtemp)*(vtemp)'
	end
	
    # compute updated parametersfor cohesion 3
	kn = k0+sdim
	vn = v0+sdim
	Psi_n = Psi + S + (k0*sdim)/(k0+sdim)*(sbar-mu_0)*(sbar-mu_0)'
	
	if Cohesion == 3
		out = -sdim * log(π) + G2a(0.5 * vn, true) - G2a(0.5 * v0, true) + 0.5 * v0 * logdet(Psi) - 0.5 * vn * logdet(Psi_n) + log(k0) - log(kn)
		return lg ? out : exp(out)
	end

	# or for cohesion 4
	knn = kn+sdim
	vnn = vn+sdim
	mu_n = (k0*mu_0 + sdim*sbar)/(k0+sdim)
	Psi_nn = Psi_n + S + (kn*sdim)/(kn+sdim)*(sbar-mu_n)*(sbar-mu_n)'
	
	if Cohesion == 4
		out = -sdim * log(π) + G2a(0.5 * vnn, true) - G2a(0.5 * vn, true) + 0.5 * vn * logdet(Psi_n) - 0.5 * vnn * logdet(Psi_nn) + log(kn) - log(knn)
		return lg ? out : exp(out)
	end
end

While the @views usage were just a simple idea to maybe gain some performance, but apparently they are not the main problem.

It’s my first time writing such “complex” and really-performance-oriented functions (since for example when moving on fitting with more serious parameters, e.g. 100k iterations, these models can literally take hours to run) so I’m not really sure on how to move and any advice is therefore welcome.

Raf · August 24, 2024, 3:06pm

You really need to use a profiler with a flamegraph to understand performance problems like this. I like ProfileView.jl, but there is also one built into vscode.

The time something takes doesn’t give you much information. A profile shows you which functions take how long and if its type stability (red) or allocation (yellow) taking the time.

fmor · August 24, 2024, 4:34pm

I tried to use a profiler, but being the code just a single long function, and not divided into subcalls, I thouthg it would not be much useful, since I couldn’t see clearly the different sections (like update this parameter, then this one, and so on).

UPDATE: there was a wrong image. Here is the right one.

So yes instead it seems that we can see something, like the yellow spots tagged with garbace collector flag.

Raf · August 24, 2024, 4:41pm

It looks like most of the time is in pretty_log and show. You don’t want to be logging while running performance tests. Or is the the first run of the profile?

Anyway, something is wrong with what you are running. I can’t really see you code at all in the graph, which means your code is not whats taking time.

fmor · August 24, 2024, 5:04pm

The pretty_log function is just defined but not used in my runs. It was there from past tests, but now I am not logging anything, so yes I could have commented it out.

Anyway I will try then to inspect it in other ways. The main point of the question was to understand if there were some performance tweaks that I could introduce in the sections of code above. So I was trying to understand

if I should preallocate all/most of the working variables, or if for small ones this would not make much difference
if the intermediate spatial_cohesion cohesion function could introduce overhead rather than direct calls to the cohesion functions
etc, general guide lines to write a performing function

Raf · August 24, 2024, 5:09pm

Sorry I don’t get it. The flame graph is showing all the problems. You are currently paying a huge price for an unused logging call.
It will show all the allocation problems too with colors.

Like really I mean everything will be in the flame graph - I’m trying to help you to help yourself. If you don’t have enough detail, run it for longer. Its a sampling profile so you need to get enough samples to build a detailed graph.

(and yes, not allocating is always a central goal in performance tuning, but avoiding type instability is too in juila. In contrast function calls are very low on the list of performance concerns)

Also, see:

https://docs.julialang.org/en/v1/manual/performance-tips/

juliohm · August 24, 2024, 5:16pm

@fmor since you are investigating geostatistical clustering methods, you might be interested in comparing your method with others from the literature:

Take a look into the GHC, GSC and SLIC methods in the link above.

fmor · August 24, 2024, 5:32pm

Yes sorry, there was an old image of my profiling tests. And now yes I see many yellow spots with the GC flag, so they should be the problem. I will investigate more on that tlamegraph then. Thank you!

fmor · August 24, 2024, 5:48pm

Yes thank you, I will try, especially since this is part of a thesis it could be good to implement such somparisons. I planned to test other models on R, but those ones on Julia could also enter the analysis.

Topic		Replies	Views
Boltzmann Factor in a loop Performance	38	1907	April 3, 2018
Inappropriate memory allocation? Julia at Scale question , memory	24	2229	September 4, 2017
Achieving C++ Speeds for a CFD Algorithm: Julia spends too much time on pre-allocation? Performance	22	3724	November 28, 2018
Memory allocation and performance Performance memory-allocation	7	679	June 26, 2020
Improve performance of matrix computation Performance	10	1186	April 25, 2018

Improving the performance of a spatial clustering model

Related topics