Could this simple function go faster?

Hi,

First time ever writing julia code. I’ve spend some time trying to optimise this function, mainly making it type-stable, and separate some precomputations. But maybe this is not the best way to go.

The resulting code is a lot faster than it’s python mpmath equivalent (about 100x faster, thanks a lot julia for natively supporting BigFloats!), but i wander if i can squeeze more out of it: this code gets integrated into a loss function, and will therefore be evaluated a lot.

Do your advanced eyes see obvious performance bottleneck in the following code that my do not ? Especially, i wander if the CartesianIndex.() syntax is not an issue here.

Here’s the code :


# A structure for precomputations.
struct PreComp
    BINS
    LAGUERRE
    FACTS
    PREC
    MAX_SUM_OF_M
    function PreComp(precision, m)
        setprecision(precision)
        m = big(m)
        BINS = zeros(eltype(m),(m,m))
        FACTS = zeros(eltype(m),m)
        LAGUERRE = zeros(BigFloat,(m,m))
        for i in 1:m
            FACTS[i] = factorial(i-1)
        end
        for i in 1:m, j in 1:m
            BINS[i,j] = binomial(i-1,j-1)
            LAGUERRE[i,j] = -BINS[i,j]/FACTS[j]*(-big(2.0))^(j-1)
        end
        new(BINS,LAGUERRE,FACTS,precision,m)
    end
end
const P = PreComp(2000,50)

# The main function
function get_coefficients(alpha, scales, m)
    # alpha must be an array with size (n,)
    # scales must be an array with size (n,d)
    # max_p must be a Tuple of integers with size (d,)

    # Trim down null alphas values:
    are_pos = alpha .!= 0
    scales = scales[are_pos,:]
    alpha = alpha[are_pos]

    # Allocates ressources, construct the simplex expression of the scales and the indexes.
    coefs = zeros(eltype(alpha),m)
    n = size(scales)[1]
    d = ndims(coefs)
    kappa = zeros(eltype(alpha),size(coefs))
    mu = deepcopy(kappa)
    S = scales ./ (big(1.0) .+ sum(scales,dims=2))
    I = CartesianIndices(coefs)

    for k in I

        if k == I[1]
            # Edge case
            kappa[1] = sum(alpha .* log.(big(1.0) .- sum(S,dims=2)))
            coefs[1] = mu[1] = exp(kappa[1])
        else
            # Indices and organisation
            tk = Tuple(k)
            k_arr = [ki for ki in tk]
            degree = findfirst(tk .!= 1)
            sk = sum(k_arr)
            k_minus_one_in_deg = copy(k_arr)
            k_minus_one_in_deg[degree] -= 1

            # Main Loop
            kappa[k] = P.FACTS[sk-d] * sum(alpha .* prod(S .^ transpose(big.(k_arr .- 1)),dims=2))
            for j in CartesianIndices(k)
                j_arr = [i for i in Tuple(j)]
                if j[degree] < k[degree]
                    mu[k] += mu[j] * kappa[k - j + I[1]] * prod(P.BINS[CartesianIndex.(k_minus_one_in_deg,j_arr)])
                end
                coefs[k] += mu[j] * prod(P.LAGUERRE[CartesianIndex.(k_arr,j_arr)])
            end
        end
    end
    coefs *= sqrt(big(2.0))^d
    return coefs
end

Here are some benchmark of applications :

# Some benchmarks : 

setprecision(2000)

using BenchmarkTools
println("1D test")
alpha = [10, 10, 10]
scales = [0.5; 0; 1]
alpha = convert.(BigFloat,alpha)
scales = convert.(BigFloat,scales)
m = (20)
@btime coefs1 = get_coefficients(alpha, scales,m)

println("2D test")
alpha = [10, 10]
scales = [0.5 0.1; 0.1 0.5]
alpha = convert.(BigFloat,alpha)
scales = convert.(BigFloat,scales)
m = (10,10)
@btime coefs2 = get_coefficients(alpha, scales,m)

println("3D test")
alpha = [10, 10, 10, 10]
scales = [0.5 0.0 0.0; 0.0 0.5 0.0; 0.0 0.0 0.5; 1 1 1]
alpha = convert.(BigFloat,alpha)
scales = convert.(BigFloat,scales)
m = (10,10,10)
@btime coefs3 = get_coefficients(alpha, scales,m)

println("4D test")
alpha = [10, 10, 5, 8, 9, 10, 10]
scales = [0.5 0.0 0.0 0.0; 0.0 0.5 0.0 0.0; 0.0 0.0 0.5 0.0; 1 1 1 1; 2 2 2 2; 2 0 2 0; 0 2 0 2]
alpha = convert.(BigFloat,alpha)
scales = convert.(BigFloat,scales)
m = (5,5,5,5)
@btime coefs4 = get_coefficients(alpha, scales,m)

The corresponding output on my machine is :

1D test
500.500 μs (4160 allocations: 551.06 KiB)
2D test
  7.774 ms (60988 allocations: 7.47 MiB)
3D test
  475.745 ms (3869963 allocations: 459.11 MiB)
4D test
  194.400 ms (1380041 allocations: 162.29 MiB)

What do you think ?

Hi, you should me it so that the fields are typed:

struct PreComp{Tb,Tl,Tf,Tp,Tm}
    BINS::Tb
    LAGUERRE::Tl
    FACTS::Tf
    PREC::Tp
    MAX_SUM_OF_M::Tm
end
function PreComp(precision, m)
        setprecision(precision)
        m = big(m)
        BINS = zeros(eltype(m),(m,m))
        FACTS = zeros(eltype(m),m)
        LAGUERRE = zeros(BigFloat,(m,m))
        for i in 1:m
            FACTS[i] = factorial(i-1)
        end
        for i in 1:m, j in 1:m
            BINS[i,j] = binomial(i-1,j-1)
            LAGUERRE[i,j] = -BINS[i,j]/FACTS[j]*(-big(2.0))^(j-1)
        end
       PreComp(BINS,LAGUERRE,FACTS,precision,m)
end

This line allocates:

sum(alpha .* prod(S .^ transpose(big.(k_arr .- 1)),dims=2))

What to you mean by ‘This line allocates’ ? I mean, i understand that this lines produces allocations of memory, but do you mean it’s a bad thing ? Should i try to make it allocates less ?

The main point is that many things are happening in that line and intermediate (temporary) arrays are created. There are cases where the compiler is able to optimise it avoid allocations, but when in doubt, you should check that by hand. With time, you get a better feeling what works and what not, but this might also change from release to release :wink:

I’d recommend to pull out that line and investigate it. You might have to write your own function to make it efficient. It always depends on your usecase.

Here is a dumb example which shows the allocation and an optimised version of the oneline, however, as you can see, the overall CPU time is the same. The optimised version however does not allocate, on the other hand you lose the accuracy and speed of the highly tuned sum function:

julia> n = 100_000; a = rand(n); b = rand(n); c = rand(n);

julia> thecalculation(a, b, c) = sum(a .+ b .* c)
thecalculation (generic function with 1 method)

julia> function thecalculation_optimised(a, b, c)
           s = 0.0
           for i in eachindex(a)
               s += a[i] + b[i] * c[i]
           end
           s
       end
thecalculation_optimised (generic function with 1 method)

julia> @btime thecalculation($a, $b, $c);
  103.171 μs (2 allocations: 781.33 KiB)

julia> @btime thecalculation_optimised($a, $b, $c);
  106.209 μs (0 allocations: 0 bytes)
1 Like

Ok i get it now after a few tests.

Do you see anything else that could obviously be done to my code ?

You can obtain substantial benefits from the algorithmic point of view (not a Julia thing but the way you compute the coefficients).
For example:

for i in 1:m
     FACTS[i] = factorial(i-1)
end 

This calls factorial one time per iteration (slow) but you can use the fact that n!=n \cdot (n-1)! (the recursive definition of the factorial) and write simply:

FACTS[1]=1
for i in 2:m
       FACTS[i] = n*factorial(i-1)
end 

Witch use only one multiplication per coefficient.

The same thing can be done with BINS (because binomial is defined in terms of factorials) and with the powers of two since (-2)^n=-2\cdot(-2)^{n-1} and by applying this together you can simplify the Laguerre coefficients computation.

In fact, the best way of computing the Laguerre coefficients is to use the recurrence relation (https://en.wikipedia.org/wiki/Laguerre_polynomials#Recursive_definition,_closed_form,_and_generating_function) and not the closed form you use.

2 Likes

Yeah i know that, but i dont care since these are precomputations for me, not part of the final run where time matters. But thanks for the head’s up !

Under the assumption that accessing an array will always be faster than re-computing it, this will stay in the precomputations :slight_smile:

It depends on the problem sometimes is faster to just do the computation (if the computation are not very intensive…) but in your case I guess using a precomputed array is the way to go :+1:.

Looking at your code when I see prod(P.BINS[CartesianIndex.(k_minus_one_in_deg,j_arr)] it suggest that maybe there is a way to reorder the computations so that instead of calling prod you can do a multiplication for each k.

But to be fair I don’t understand fully what your code does because I’m not used to cartesianindex stuff :sweat_smile:

If you want, this is equivalent to prod([P.BINS[i,j] for (i,j) in zip(k_minus_on_in_deg,j_arr)]).

Accumulating them, if this is what you mean, might indeed be possible. I’ll look into it. Thx :slight_smile:

Probably you don’t need this if you expand the line where j_arr is used into an appropriate loop instead of trying to broadcast everything.

In the far far country where I came from, broadcasted calls are a lot faster than for loops. Old habbits… I gained 15% runtime by a few more loops instead of broadcasted calls ! Thanks for the hint.

2 Likes

The basic issue here is that, for BigFloat and BigInt (“bignum”) calculations, the arithmetic cost will typically dominate any algorithm. Almost all that matters is probably the number of bignum operations. Type stability, allocation of temporary arrays, and all of the other usual Julia considerations are likely to have relatively small impact. So, you need to focus on algorithmic improvements that reduce the number of arithmetic operations.

5 Likes

This is clearly an issue. I could do more precomputations to replace prod(P.BINS[CartesianIndex.(k_minus_one_in_deg,j_arr)]), which demands n product of bigfloats by only one indexing, but it gave me an OutOfMemory error if i want to cover all cases upfront :wink:

What about https://github.com/JuliaArbTypes/ArbFloats.jl ?

Yes, you could also try other bignum libraries. The built-in BigFloat uses GNU MPFR.

2 Likes

I know this is quite old, but I just wanted to note that there is an efficient one-liner that avoids memory allocation:

thecalculation_simpler(a, b, c) = sum(t -> muladd(t...), zip(b, c, a))

There are a lot of primitives already available, and sometimes it takes a bit of experimentation to find creative ways to combine them in an efficient manner.

Running

n = 1_000_000; a = rand(n); b = rand(n); c = rand(n);
@btime thecalculation(a, b, c)
@btime thecalculation_optimised(a, b, c)
@btime thecalculation_simpler(a, b, c)

for me reports

1.662 ms (3 allocations: 7.63 MiB)
1.359 ms (1 allocation: 16 bytes)
1.322 ms (1 allocation: 16 bytes)

I think you replied to the wrong thread, @FedericoStra