Any suggestion for speeding up my function?

tiZ · November 17, 2022, 9:56am

Hi,

I have a function accepts 4 matrixs and do some caculations, it works well but very slow. I wonder to know how to improve its performance?

I’m wondering if it’s possible to be faster, I just know that using @. the speed will be faster.

Here is a demo for my function, Θ,M,Π,X are all matrix, with same sizes, while X is Matrix{Int},others are Matrix{Float32}

using SpecialFunctions
γ = loggamma # let the γ represents the loggamma() function

function myFunc(Θ,M,Π,X)
    eps_ = 1f-10
    
    A = @. γ(Θ + eps_) + γ(X + 1) - γ(X + Θ + eps_)

    B = @. (Θ + X) * log( 1 + M / (Θ + eps_) ) +
        X * log(Θ + eps_) - log(M + eps_)
    C = A .+ B
    D = @. C  - log(1f0 - Π + eps_)
    E = @. (Θ/( Θ + M + eps_))^Θ
    F = @. -log(Π + (1f0 - Π) * E + eps_)

    # if an element in matrix X is less than 0,
    # return correspond element in matrix F, else return the one in matrix D,
    # finally sum them up.
    sum(ifelse.(X .< 1f-8, F, D))
end

# demo data

r,c = 20155, 7576
X = sample(1:20, r * c) |> x -> reshape(x, r, c)
Θ = rand(Float32,r,c)
M = rand(Float32,r,c)
Π = rand(Float32,r,c)

@time myFunc(Θ,M,Π,X)
#  21.572525 seconds (36 allocations: 6.257 GiB, 4.96% gc time)

Any suggestions are appreciated!

JorizovdZ · November 17, 2022, 10:10am

t1 and t2 are not defined in your function or demo data

The ‘middle’ X should also be capital.

tiZ · November 17, 2022, 10:16am

Sorry for my typos, I have correct it.

lmiq · November 17, 2022, 11:01am

Each line of your code creates a new intermediate array. It will be much faster if you merge all the operations on the elements in a single loop, like:

for i in eachindex(X)
     a = ...
     b = ...
     etc...
end

where a is one element of what you are computing as A, for example.

(didn’t check the details)

JorizovdZ · November 17, 2022, 11:04am

I am not that good at optimizing, however you can make an additional allocation which decreases the computation time. The calculation of Θ + eps_ is executed multiple times, so can be allocated with eps_Θ = @. Θ + eps_:

function myFunc(Θ,M,Π,X)
    eps_ = 1f-10
    eps_Θ = @. Θ + eps_
    A = @. γ(eps_Θ) + γ(X + 1) - γ(X + eps_Θ)

    B = @. (Θ + X) * log( 1 + M / (eps_Θ) ) +
        X * log(eps_Θ) - log(M + eps_)
    C = @. A + B
    D = @. C  - log(1f0 - Π + eps_)
    E = @. (Θ/(eps_Θ + M))^Θ
    F = @. -log(Π + (1f0 - Π) * E + eps_)

    # if an element in matrix X is less than 0,
    # return correspond element in matrix F, else return the one in matrix D,
    # finally sum them up.
    sum(ifelse.(X .< 1f-8, F, D))
end

GunnarFarneback · November 17, 2022, 11:19am

Whenever you introduce a function alias, do so with const, otherwise it will be considered a non-const global and be bad for performance. When you call it just a few times in vectorized form it doesn’t matter so much but as soon as you call it in a hot loop it will be a disaster.

const γ = loggamma # let the γ represents the loggamma() function

In your function you compute a lot of values pointwise and then sum them up. There is no need to have all values available at the same time, and in particular not all the intermediate values in the calculations. Moreover, if you don’t vectorize each step, you can avoid computing both (A, B, C, D) and (E, F).

This is an easy transformation which avoids most of the unnecessary allocations and computations:

myFunc(Θ, M, Π, X) = sum(myFunc2.(Θ, M, Π, X))

function myFunc2(Θ, M, Π, X)
    eps_ = 1f-10
    if X < 1f-8
        E = (Θ/( Θ + M + eps_))^Θ
        F = -log(Π + (1f0 - Π) * E + eps_)
    else
        A = γ(Θ + eps_) + Float32(γ(X + 1)) - γ(X + Θ + eps_)
        B = (Θ + X) * log( 1 + M / (Θ + eps_) ) +
            X * log(Θ + eps_) - log(M + eps_)
        C = A + B
        D = C - log(1f0 - Π + eps_)
    end
end

DNF · November 17, 2022, 1:30pm

Slightly faster, and with zero allocations:

myFunc(Θ, M, Π, X) = sum(t->myFunc2(t...), zip(Θ, M, Π, X))

The other version allocates half a Gigabyte, so getting rid of allocations entirely can be useful.

DNF · November 17, 2022, 1:34pm

This is just the same as

X = rand(1:20, r, c)

GunnarFarneback · November 17, 2022, 1:50pm

I skipped this transformation to keep it simpler but also because I had a nagging feeling that it wouldn’t be safe to miss out on the pairwise summation used for arrays, given this large number of low precision numbers. This illustrates the dangers:

julia> f(Θ, M, Π, X) = 0.1f0
f (generic function with 1 method)

julia> sum(t -> f(t...), zip(Θ, M, Π, X))
2.097152f6

julia> sum(f.(Θ, M, Π, X))
1.526943f7

lmiq · November 17, 2022, 4:22pm

Yet, with @DNF version, we can do:

julia> myFunc(Θ, M, Π, X) = sum(myFunc2.(Θ, M, Π, X))

julia> @time myFunc(Θ, M, Π, X) 
 10.241626 seconds (3 allocations: 582.483 MiB, 0.09% gc time)
4.7243494f8

julia> using ThreadsX

julia> myFunc3(Θ, M, Π, X) = ThreadsX.sum(t -> myFunc(t...), zip(Θ, M, Π, X))
myFunc3 (generic function with 1 method)

julia> @time myFunc3(Θ, M, Π, X)
  2.195325 seconds (3.81 M allocations: 164.194 MiB, 2.17% gc time, 12.79% compilation time)
4.7243302f8

(starting Julia here with julia -t auto - 6 cores laptop)

(note that the results are clearly different, because of the order of the summation, thus the non-parallel version is “more correct” obs: using the summation over the array - weather this is useful or not in specific cases is to be checked).

GunnarFarneback · November 17, 2022, 5:32pm

On the contrary, the parallel summation is usually more accurate than the sequential summation. Compare against sum(Float64.(myFunc2.(Θ, M, Π, X))).

lmiq · November 17, 2022, 5:39pm

I meant your non-parallel version (the one that uses the algorithm for summation over arrays). I clarified that now by putting the definition of the function I was comparing to there.

Topic		Replies	Views
Sum operations between arrays Performance	21	5627	April 7, 2020
Speeding up a sum involving 3 matrices Performance linearalgebra , tullio	4	612	August 14, 2022
Efficient Evaluation of a Matrix of Function Products Performance	8	340	June 30, 2022
Speed up a loop with matrix operations Performance	4	375	September 21, 2021
Why fewer memory allocations does not necessarily suggest higher speed New to Julia performance , memory-allocation	5	792	June 6, 2021

Any suggestion for speeding up my function?

Related topics