Tidiest way to reuse variables inside a function to reduce gc time

I’m writing a function to fit a spatio-temporal model for clustering. At each iteration it has to compute spatial cohesions on subsets of units, and I saw that this takes a lot of the execution time, which is spent in garbage collection. Or at least this is what I interpreted from this flame graph, which represents a “medium size” fit with 30k iterations on 60 units and 12 time instants.

So i was wondering which was the tidiest way to improve this function execution, for example preallocating in some ways all the variables used in the computation (sdim, sbar1, etc) to recycle their use at every call of the function.

I think I could just define them before the iteration loop and pass them as arguments to the function, but I thought that there could be a better and tidier approach to do that (as I found for example here which however was in the case of vectors).

Here is the portion of code I was working one:

using Statistics
using SpecialFunctions
const logpi = log(π)
function G2a(a::Real, lg::Bool)
	out = logpi + lgamma(a) + lgamma(a - 0.5)
	return lg ? out : exp(out)
end

function cohesion3(s1::AbstractVector{Float64}, s2::AbstractVector{Float64}, mu_0::AbstractVector{Float64}, k0::Real, v0::Real, Psi::Matrix{Float64}; lg::Bool, M::Real=1.0)::Float64
	sdim = length(s1)
	# Compute sample means
	sbar1 = mean(s1)
	sbar2 = mean(s2)
	# Compute deviations from the sample mean
	S1, S2, S3, S4 = 0.0, 0.0, 0.0, 0.0
	@inbounds for i in 1:sdim
		s_sbar1 = s1[i] - sbar1
		s_sbar2 = s2[i] - sbar2

		S1 += s_sbar1 * s_sbar1
		S4 += s_sbar2 * s_sbar2
		S2 += s_sbar1 * s_sbar2
	end
	S3 = copy(S2) # to avoid repeating computations
	# Updated parameters for cohesion 3
	kn = k0 + sdim
	vn = v0 + sdim

	auxvec1_1 = sbar1 - mu_0[1]
	auxvec1_2 = sbar2 - mu_0[2]

	auxmat1_1 = auxvec1_1^2
	auxmat1_2 = auxvec1_1 * auxvec1_2
	auxmat1_3 = copy(auxmat1_2)
	auxmat1_4 = auxvec1_2^2

	auxconst1 = k0 * sdim
	auxconst2 = k0 + sdim
	Psi_n_1 = Psi[1] + S1 + auxconst1 / (auxconst2) * auxmat1_1
	Psi_n_2 = Psi[2] + S2 + auxconst1 / (auxconst2) * auxmat1_2
	Psi_n_3 = Psi[3] + S3 + auxconst1 / (auxconst2) * auxmat1_3
	Psi_n_4 = Psi[4] + S4 + auxconst1 / (auxconst2) * auxmat1_4

	detPsi_n = Psi_n_1 * Psi_n_4 - Psi_n_2 * Psi_n_3
	detPsi = Psi[1] * Psi[4] - Psi[2] * Psi[3]

	out = -sdim * logpi + G2a(0.5 * vn, true) - G2a(0.5 * v0, true) + 0.5 * v0 * log(detPsi) - 0.5 * vn * log(detPsi_n) + log(k0) - log(kn)
	return lg ? out : exp(out)
end

# simple test example
n = 20
s1 = rand(n)
s2 = rand(n)
mu_0 = [1.,2.]
k0 = 0.5
v0 = 0.5
Psi = [1. 0.2; 0.2 2.]
@timev cohesion3(s1, s2, mu_0, k0, v0, Psi, lg=false)

In the code you provided I don’t see any allocations in cohesion3, so garbage collection should not be an issue, nor would preallocating help. The @timev output is

julia> @timev cohesion3(s1, s2, mu_0, k0, v0, Psi, lg=false)
  0.000008 seconds (1 allocation: 16 bytes)
elapsed time (ns):  8500
gc time (ns):       0
bytes allocated:    16
pool allocs:        1
non-pool GC allocs: 0
minor collections:  0
full collections:   0
1.1979943791473496e-13

so there is not much to worry about here. The single allocations is due to using global variables, so disappears when using e.g. @btime with interpolation ($). Perhaps the provided code does not actually illustrate the issue you encounter?


You could certainly make the code a bit tidier, e.g. using StaticArrays.jl. For example, you can replace

Psi_n_1 = Psi[1] + S1 + auxconst1 / (auxconst2) * auxmat1_1
Psi_n_2 = Psi[2] + S2 + auxconst1 / (auxconst2) * auxmat1_2
Psi_n_3 = Psi[3] + S3 + auxconst1 / (auxconst2) * auxmat1_3
Psi_n_4 = Psi[4] + S4 + auxconst1 / (auxconst2) * auxmat1_4

with some broadcasting:

Psi_n = Psi .+ S .+ auxconst1 / auxconst2 .* auxmat1

Of course, this requires some previous definitions

S = SMatrix{2, 2}((S1, S2, S3, S4))  # or you could use S = @MMatrix zeros(2, 2) and don't work with S1 etc. at all, but this does allocate
sbar = SVector((sbar1, sbar2))
auxvec1 = sbar .- mu_0 
auxmat1 = auxvec1 * auxvec1'

In general I would say such vectorised statements are cleaner, and also closer to the mathematics you’d write.

Ideally, you would also make Psi an SMatrix{2, 2, Float64} and mu_0 an SVector{2, Float64} (or mutable variants) to avoid some allocations in the new declarations of auxvec1 and Psi_n. Additionally, you can also just use det(Psi_n) after importing LinearAlgebra.jl.


Edit:

Resulting somewhat tidier code (in my opinion)
using Statistics
using SpecialFunctions
using LinearAlgebra
using StaticArrays
using BenchmarkTools

const logpi = log(π)

function G2a(a::Real, lg::Bool)
	out = logpi + lgamma(a) + lgamma(a - 0.5)
	return lg ? out : exp(out)
end

function cohesion3(s1::AbstractVector{Float64}, s2::AbstractVector{Float64}, mu_0::AbstractVector{Float64}, k0::Real, v0::Real, Psi::AbstractMatrix{Float64}; lg::Bool, M::Real=1.0, S=@MMatrix zeros(2, 2))::Float64
    sdim = length(s1)
    # Compute sample means
    sbar1 = mean(s1)
    sbar2 = mean(s2)
    # Compute deviations from the sample mean
    S .= 0.
    @inbounds for i in 1:sdim
        s_sbar1 = s1[i] - sbar1
        s_sbar2 = s2[i] - sbar2

        S[1, 1] += s_sbar1 * s_sbar1
        S[2, 2] += s_sbar2 * s_sbar2
        S[2, 1] += s_sbar1 * s_sbar2
    end
    S[1, 2] = S[2, 1] # to avoid repeating computations
    # Updated parameters for cohesion 3
    kn = k0 + sdim
    vn = v0 + sdim

    sbar = SVector((sbar1, sbar2))
     # (You could probably also stack s1 and s2 into a matrix, and use row/columnwise mean to get sbar directly.)
    auxvec1 = sbar .- mu_0
    auxmat1 = auxvec1 * auxvec1'

    auxconst1 = k0 * sdim
    auxconst2 = k0 + sdim
    Psi_n = Psi .+ S .+ auxconst1 / (auxconst2) .* auxmat1

    out = -sdim * logpi + G2a(0.5 * vn, true) - G2a(0.5 * v0, true) + 0.5 * v0 * log(det(Psi)) - 0.5 * vn * log(det(Psi_n)) + log(k0) - log(kn)
    return lg ? out : exp(out)
end

n = 20
s1 = rand(n)
s2 = rand(n)
mu_0 = @SVector [1.,2.]
k0 = 0.5
v0 = 0.5
Psi = @SMatrix [1. 0.2; 0.2 2.]
S = @MMatrix zeros(2, 2)
@btime cohesion3($s1, $s2, $mu_0, $k0, $v0, $Psi, lg=false, S=$S)
    #    242.462 ns (0 allocations: 0 bytes)
3 Likes

Thank you, I wrote originally a vector version of that function, but that turned out to be very slow due to all the memory allocated and then freed:

# original vector one
julia> @btime cohesion3_4($s1, $s2, $mu_0, $k0, $v0, $Psi; Cohesion=3, lg=lg, M=1.0)
  3.837 μs (55 allocations: 4.55 KiB)
7.084107618097855e-9

# scalar one, of the code above
julia> @btime cohesion3($s1, $s2, $mu_0, $k0, $v0, $Psi; lg=lg, M=1.0)
  651.678 ns (5 allocations: 112 bytes)
7.084107618097855e-9

The vector one was indeed tidier and more readable, but I abandoned it in favour of the scalar one since it was more efficient. But maybe yes using StaticArrays I could solve the performance issues of the vector one while mantaining more readability. Thank you, I will try with them :slight_smile:

1 Like

I’m also updating another cohesion function with StaticArrays. I wrote this proposal, followig the style and reasoning of your updated code, but however I get Bad input for @MMatrix error. Can you explain me why? especially because running it line by line it works.

Update: maybe I can just preallocate S, while the other matrices I can leave them as they are inside the code. Yes, this new version works.

function cohesion4_propose(s1::AbstractVector{Float64}, s2::AbstractVector{Float64}, mu_0::AbstractVector{Float64}, k0::Real, v0::Real, Psi::AbstractMatrix{Float64}; lg::Bool, M::Real=1.0, S=@MMatrix zeros(2, 2))::Float64
    sdim = length(s1)
    # Compute sample means
    sbar1 = mean(s1)
    sbar2 = mean(s2)
    # Compute deviations from the sample mean
    S .= 0.
    @inbounds for i in 1:sdim
        s_sbar1 = s1[i] - sbar1
        s_sbar2 = s2[i] - sbar2

        S[1, 1] += s_sbar1 * s_sbar1
        S[2, 2] += s_sbar2 * s_sbar2
        S[2, 1] += s_sbar1 * s_sbar2
    end
    S[1, 2] = S[2, 1] # to avoid repeating computations
    # Updated parameters for cohesion 3
    kn = k0 + sdim
    vn = v0 + sdim

    sbar = SVector((sbar1, sbar2))
     # (You could probably also stack s1 and s2 into a matrix, and use row/columnwise mean to get sbar directly.)
    auxvec1 = sbar .- mu_0
    auxmat1 = auxvec1 * auxvec1'

    auxconst1 = k0 * sdim
    auxconst2 = k0 + sdim
    Psi_n = Psi .+ S .+ auxconst1 / (auxconst2) .* auxmat1

	# Updated parameters for cohesion 4
	knn = kn + sdim
	vnn = vn + sdim

	mu_n = (k0 * mu_0 + sdim * sbar) / (auxconst2)
	sbar_mun = sbar - mu_n
	auxmat2 = sbar_mun * sbar_mun' 

	auxconst3 = kn * sdim
	auxconst4 = kn + sdim
	Psi_nn = Psi_n .+ S .+ auxconst3 / (auxconst4) .* auxmat2

	out = -sdim * logpi + G2a(0.5 * vnn, true) - G2a(0.5 * vn, true) + 0.5 * vn * logdet(Psi_n) - 0.5 * vnn * logdet(Psi_nn) + log(kn) - log(knn)
	return lg ? out : exp(out)
end

I think you should just add some brackets, to make clear where the code affected by the macro ends, e.g.

julia> using StaticArrays

julia> f(x = @MMatrix zeros(2, 2), y = @MMatrix zeros(2, 2)) = nothing
ERROR: LoadError: Bad input for @MMatrix
Stacktrace:
(...)

julia> f(x = @MMatrix(zeros(2, 2)), y = @MMatrix(zeros(2, 2))) = nothing
f (generic function with 3 methods)

Indeed, creating SArrays inside the function is fine as they don’t allocate. If you need mutability (MArrays (or ordinary Arrays), which do allocate), preallocation can be useful.

1 Like

You have two variants of the variance loop:

    @inbounds for i in 1:sdim
        s_sbar1 = s1[i] - sbar1
        s_sbar2 = s2[i] - sbar2

        S[1, 1] += s_sbar1 * s_sbar1
        S[2, 2] += s_sbar2 * s_sbar2
        S[2, 1] += s_sbar1 * s_sbar2
    end

vs

	@inbounds for i in 1:sdim
		s_sbar1 = s1[i] - sbar1
		s_sbar2 = s2[i] - sbar2

		S1 += s_sbar1 * s_sbar1
		S4 += s_sbar2 * s_sbar2
		S2 += s_sbar1 * s_sbar2
	end

Of these two, the scalar one is probably preferable.

The reason is the following:

Under naive compilation, S1 += s_sbar1*s_sbar1 keeps S1 in a register, while S[1, 1] += s_sbar1 * s_sbar1 needs to load a value from S[1,1] and then store to it again. This is more expensive! The load comes from the store queue and goes into the store queue, so you can expect that this doesn’t cause significant memory traffic; but loading from store queue is still not free and has multiple cycles of latency.

Now, in real life the compiler will try to avoid that: It will try to convert your code into something like

S_tmp = S[1,1]
for ...
    S_tmp += ...
end
S[1,1] = S_tmp

In order to do that, the compiler needs to prove that nobody else is reading from or writing to S[1,1] inside the loop (alias analysis!), and that there are no other loop exits that manage to avoid the store. This especially means that the compiler needs to prove that the loop body does not throw!

Such analysis is fickle. Even if you benchmark that this works fine in your context, this is not obvious from the source code. It may break if you at some point pass an AbstractVector that happens to be non-standard in the future (e.g. using pointers / unsafe_load under the hood, or that potentially throws). Or it may break in a future or past julia version (you benchmarked that your code is fast on 1.10 but it might be slow on 1.6).

Generally it is a pain to review such source code for performance issues.

Instead, there are two reasonable approaches:

  1. You scalarize.
  2. You allocate a temporary MMatrix S inside the function.

Approach 2 informs the compiler that it has unique ownership of S. Hence nobody else can read/write to it, and in case of exceptions, writes to S are invisible, because S does not escape. Ideally the compiler will be able to elide the allocation and transform that variant into the scalarized form. But even if that fails, a single MMatrix allocation is cheapish.

1 Like

Thanks for the deep and interesting analysis! In the end yes as suggested by you and eldee I’m moving towards using a Mutable Matrix for S, also passing it as an argument in order to preallocate and reuse it at every iteration.

# original pure vector form
julia> @btime cohesion3_4($s1, $s2, $mu_0, $k0, $v0, $Psi, Cohesion=4, lg=lg, M=1.0)
  1.200 μs (29 allocations: 1.70 KiB)
0.00021254299467689762

# scalar form
julia> @btime cohesion4($s1, $s2, $mu_0, $k0, $v0, $Psi; lg=lg, M=1.0)
  532.086 ns (8 allocations: 240 bytes)
0.00021254299467689762

# updated vector form with StaticArrays package
# S=@MMatrix zeros(2, 2)
julia> @btime cohesion4_propose($s1, $s2, $mu_0, $k0, $v0, $Psi; lg=lg, M=1.0, S=$S)
  511.696 ns (7 allocations: 192 bytes)
0.00021254299467689762

Firstly, lgamma from SpecialFunctions.jl is deprecated, don’t use it:

julia> using SpecialFunctions

julia> lgamma(2.5)
┌ Warning: `lgamma(x::Real)` is deprecated, use `(logabsgamma(x))[1]` instead.
│   caller = top-level scope at REPL[2]:1
└ @ Core REPL[2]:1
0.2846828704729192

One nonobvious thing is that using deprecated functions can negatively affect performance. You could do something like this:

using SpecialFunctions: logabsgamma

function lgamma(x::Real)
    first(logabsgamma(x))
end

Apart from that, I suppose there may be a good way to calculate lgamma(a) + lgamma(a - 0.5) directly, without calling logabsgamma twice.

For the lgamma notice, thank you, I will fix that as suggested. About combining some of the terms I was already thinking about that, but that would for example imply another operation, a multiplication, to combine the logarithms, which I thought could be more expensive that a sum plus a small function call. I will anyway test the two approaches.

# a = 20
# b = a-0.5
julia> @btime log(gamma($a)*gamma($b))
  105.223 ns (0 allocations: 0 bytes)
77.2009706961606

julia> @btime lgamma($a) + lgamma($b)
  42.222 ns (0 allocations: 0 bytes)
77.2009706961606
1 Like