I have a function accepts 4 matrixs and do some caculations, it works well but very slow. I wonder to know how to improve its performance?

I’m wondering if it’s possible to be faster, I just know that using @. the speed will be faster.

Here is a demo for my function, Θ,M,Π,X are all matrix, with same sizes, while X is Matrix{Int},others are Matrix{Float32}

using SpecialFunctions
γ = loggamma # let the γ represents the loggamma() function
function myFunc(Θ,M,Π,X)
eps_ = 1f-10
A = @. γ(Θ + eps_) + γ(X + 1) - γ(X + Θ + eps_)
B = @. (Θ + X) * log( 1 + M / (Θ + eps_) ) +
X * log(Θ + eps_) - log(M + eps_)
C = A .+ B
D = @. C - log(1f0 - Π + eps_)
E = @. (Θ/( Θ + M + eps_))^Θ
F = @. -log(Π + (1f0 - Π) * E + eps_)
# if an element in matrix X is less than 0,
# return correspond element in matrix F, else return the one in matrix D,
# finally sum them up.
sum(ifelse.(X .< 1f-8, F, D))
end
# demo data
r,c = 20155, 7576
X = sample(1:20, r * c) |> x -> reshape(x, r, c)
Θ = rand(Float32,r,c)
M = rand(Float32,r,c)
Π = rand(Float32,r,c)
@time myFunc(Θ,M,Π,X)
# 21.572525 seconds (36 allocations: 6.257 GiB, 4.96% gc time)

Each line of your code creates a new intermediate array. It will be much faster if you merge all the operations on the elements in a single loop, like:

for i in eachindex(X)
a = ...
b = ...
etc...
end

where a is one element of what you are computing as A, for example.

I am not that good at optimizing, however you can make an additional allocation which decreases the computation time. The calculation of Θ + eps_ is executed multiple times, so can be allocated with eps_Θ = @. Θ + eps_:

function myFunc(Θ,M,Π,X)
eps_ = 1f-10
eps_Θ = @. Θ + eps_
A = @. γ(eps_Θ) + γ(X + 1) - γ(X + eps_Θ)
B = @. (Θ + X) * log( 1 + M / (eps_Θ) ) +
X * log(eps_Θ) - log(M + eps_)
C = @. A + B
D = @. C - log(1f0 - Π + eps_)
E = @. (Θ/(eps_Θ + M))^Θ
F = @. -log(Π + (1f0 - Π) * E + eps_)
# if an element in matrix X is less than 0,
# return correspond element in matrix F, else return the one in matrix D,
# finally sum them up.
sum(ifelse.(X .< 1f-8, F, D))
end

Whenever you introduce a function alias, do so with const, otherwise it will be considered a non-const global and be bad for performance. When you call it just a few times in vectorized form it doesn’t matter so much but as soon as you call it in a hot loop it will be a disaster.

const γ = loggamma # let the γ represents the loggamma() function

In your function you compute a lot of values pointwise and then sum them up. There is no need to have all values available at the same time, and in particular not all the intermediate values in the calculations. Moreover, if you don’t vectorize each step, you can avoid computing both (A, B, C, D) and (E, F).

This is an easy transformation which avoids most of the unnecessary allocations and computations:

myFunc(Θ, M, Π, X) = sum(myFunc2.(Θ, M, Π, X))
function myFunc2(Θ, M, Π, X)
eps_ = 1f-10
if X < 1f-8
E = (Θ/( Θ + M + eps_))^Θ
F = -log(Π + (1f0 - Π) * E + eps_)
else
A = γ(Θ + eps_) + Float32(γ(X + 1)) - γ(X + Θ + eps_)
B = (Θ + X) * log( 1 + M / (Θ + eps_) ) +
X * log(Θ + eps_) - log(M + eps_)
C = A + B
D = C - log(1f0 - Π + eps_)
end
end

I skipped this transformation to keep it simpler but also because I had a nagging feeling that it wouldn’t be safe to miss out on the pairwise summation used for arrays, given this large number of low precision numbers. This illustrates the dangers:

julia> f(Θ, M, Π, X) = 0.1f0
f (generic function with 1 method)
julia> sum(t -> f(t...), zip(Θ, M, Π, X))
2.097152f6
julia> sum(f.(Θ, M, Π, X))
1.526943f7

julia> myFunc(Θ, M, Π, X) = sum(myFunc2.(Θ, M, Π, X))
julia> @time myFunc(Θ, M, Π, X)
10.241626 seconds (3 allocations: 582.483 MiB, 0.09% gc time)
4.7243494f8
julia> using ThreadsX
julia> myFunc3(Θ, M, Π, X) = ThreadsX.sum(t -> myFunc(t...), zip(Θ, M, Π, X))
myFunc3 (generic function with 1 method)
julia> @time myFunc3(Θ, M, Π, X)
2.195325 seconds (3.81 M allocations: 164.194 MiB, 2.17% gc time, 12.79% compilation time)
4.7243302f8

(starting Julia here with julia -t auto - 6 cores laptop)

(note that the results are clearly different, because of the order of the summation, thus the non-parallel version is “more correct” obs: using the summation over the array - weather this is useful or not in specific cases is to be checked).

I meant your non-parallel version (the one that uses the algorithm for summation over arrays). I clarified that now by putting the definition of the function I was comparing to there.