Could this simple function go faster?

lrnv · November 12, 2020, 8:27am

Hi,

First time ever writing julia code. I’ve spend some time trying to optimise this function, mainly making it type-stable, and separate some precomputations. But maybe this is not the best way to go.

The resulting code is a lot faster than it’s python mpmath equivalent (about 100x faster, thanks a lot julia for natively supporting BigFloats!), but i wander if i can squeeze more out of it: this code gets integrated into a loss function, and will therefore be evaluated a lot.

Do your advanced eyes see obvious performance bottleneck in the following code that my do not ? Especially, i wander if the CartesianIndex.() syntax is not an issue here.

Here’s the code :


# A structure for precomputations.
struct PreComp
    BINS
    LAGUERRE
    FACTS
    PREC
    MAX_SUM_OF_M
    function PreComp(precision, m)
        setprecision(precision)
        m = big(m)
        BINS = zeros(eltype(m),(m,m))
        FACTS = zeros(eltype(m),m)
        LAGUERRE = zeros(BigFloat,(m,m))
        for i in 1:m
            FACTS[i] = factorial(i-1)
        end
        for i in 1:m, j in 1:m
            BINS[i,j] = binomial(i-1,j-1)
            LAGUERRE[i,j] = -BINS[i,j]/FACTS[j]*(-big(2.0))^(j-1)
        end
        new(BINS,LAGUERRE,FACTS,precision,m)
    end
end
const P = PreComp(2000,50)

# The main function
function get_coefficients(alpha, scales, m)
    # alpha must be an array with size (n,)
    # scales must be an array with size (n,d)
    # max_p must be a Tuple of integers with size (d,)

    # Trim down null alphas values:
    are_pos = alpha .!= 0
    scales = scales[are_pos,:]
    alpha = alpha[are_pos]

    # Allocates ressources, construct the simplex expression of the scales and the indexes.
    coefs = zeros(eltype(alpha),m)
    n = size(scales)[1]
    d = ndims(coefs)
    kappa = zeros(eltype(alpha),size(coefs))
    mu = deepcopy(kappa)
    S = scales ./ (big(1.0) .+ sum(scales,dims=2))
    I = CartesianIndices(coefs)

    for k in I

        if k == I[1]
            # Edge case
            kappa[1] = sum(alpha .* log.(big(1.0) .- sum(S,dims=2)))
            coefs[1] = mu[1] = exp(kappa[1])
        else
            # Indices and organisation
            tk = Tuple(k)
            k_arr = [ki for ki in tk]
            degree = findfirst(tk .!= 1)
            sk = sum(k_arr)
            k_minus_one_in_deg = copy(k_arr)
            k_minus_one_in_deg[degree] -= 1

            # Main Loop
            kappa[k] = P.FACTS[sk-d] * sum(alpha .* prod(S .^ transpose(big.(k_arr .- 1)),dims=2))
            for j in CartesianIndices(k)
                j_arr = [i for i in Tuple(j)]
                if j[degree] < k[degree]
                    mu[k] += mu[j] * kappa[k - j + I[1]] * prod(P.BINS[CartesianIndex.(k_minus_one_in_deg,j_arr)])
                end
                coefs[k] += mu[j] * prod(P.LAGUERRE[CartesianIndex.(k_arr,j_arr)])
            end
        end
    end
    coefs *= sqrt(big(2.0))^d
    return coefs
end

Here are some benchmark of applications :

# Some benchmarks : 

setprecision(2000)

using BenchmarkTools
println("1D test")
alpha = [10, 10, 10]
scales = [0.5; 0; 1]
alpha = convert.(BigFloat,alpha)
scales = convert.(BigFloat,scales)
m = (20)
@btime coefs1 = get_coefficients(alpha, scales,m)

println("2D test")
alpha = [10, 10]
scales = [0.5 0.1; 0.1 0.5]
alpha = convert.(BigFloat,alpha)
scales = convert.(BigFloat,scales)
m = (10,10)
@btime coefs2 = get_coefficients(alpha, scales,m)

println("3D test")
alpha = [10, 10, 10, 10]
scales = [0.5 0.0 0.0; 0.0 0.5 0.0; 0.0 0.0 0.5; 1 1 1]
alpha = convert.(BigFloat,alpha)
scales = convert.(BigFloat,scales)
m = (10,10,10)
@btime coefs3 = get_coefficients(alpha, scales,m)

println("4D test")
alpha = [10, 10, 5, 8, 9, 10, 10]
scales = [0.5 0.0 0.0 0.0; 0.0 0.5 0.0 0.0; 0.0 0.0 0.5 0.0; 1 1 1 1; 2 2 2 2; 2 0 2 0; 0 2 0 2]
alpha = convert.(BigFloat,alpha)
scales = convert.(BigFloat,scales)
m = (5,5,5,5)
@btime coefs4 = get_coefficients(alpha, scales,m)

The corresponding output on my machine is :

1D test
500.500 μs (4160 allocations: 551.06 KiB)
2D test
  7.774 ms (60988 allocations: 7.47 MiB)
3D test
  475.745 ms (3869963 allocations: 459.11 MiB)
4D test
  194.400 ms (1380041 allocations: 162.29 MiB)

What do you think ?

rveltz · November 12, 2020, 8:46am

lrnv:

struct PreComp
    BINS
    LAGUERRE
    FACTS
    PREC
    MAX_SUM_OF_M
    function PreComp(precision, m)
        setprecision(precision)
        m = big(m)
        BINS = zeros(eltype(m),(m,m))
        FACTS = zeros(eltype(m),m)
        LAGUERRE = zeros(BigFloat,(m,m))
        for i in 1:m
            FACTS[i] = factorial(i-1)
        end
        for i in 1:m, j in 1:m
            BINS[i,j] = binomial(i-1,j-1)
            LAGUERRE[i,j] = -BINS[i,j]/FACTS[j]*(-big(2.0))^(j-1)
        end
        new(BINS,LAGUERRE,FACTS,precision,m)
    end
end

Hi, you should me it so that the fields are typed:

struct PreComp{Tb,Tl,Tf,Tp,Tm}
    BINS::Tb
    LAGUERRE::Tl
    FACTS::Tf
    PREC::Tp
    MAX_SUM_OF_M::Tm
end
function PreComp(precision, m)
        setprecision(precision)
        m = big(m)
        BINS = zeros(eltype(m),(m,m))
        FACTS = zeros(eltype(m),m)
        LAGUERRE = zeros(BigFloat,(m,m))
        for i in 1:m
            FACTS[i] = factorial(i-1)
        end
        for i in 1:m, j in 1:m
            BINS[i,j] = binomial(i-1,j-1)
            LAGUERRE[i,j] = -BINS[i,j]/FACTS[j]*(-big(2.0))^(j-1)
        end
       PreComp(BINS,LAGUERRE,FACTS,precision,m)
end

This line allocates:

sum(alpha .* prod(S .^ transpose(big.(k_arr .- 1)),dims=2))

lrnv · November 12, 2020, 8:52am

What to you mean by ‘This line allocates’ ? I mean, i understand that this lines produces allocations of memory, but do you mean it’s a bad thing ? Should i try to make it allocates less ?

tamasgal · November 12, 2020, 9:31am

The main point is that many things are happening in that line and intermediate (temporary) arrays are created. There are cases where the compiler is able to optimise it avoid allocations, but when in doubt, you should check that by hand. With time, you get a better feeling what works and what not, but this might also change from release to release

I’d recommend to pull out that line and investigate it. You might have to write your own function to make it efficient. It always depends on your usecase.

Here is a dumb example which shows the allocation and an optimised version of the oneline, however, as you can see, the overall CPU time is the same. The optimised version however does not allocate, on the other hand you lose the accuracy and speed of the highly tuned sum function:

julia> n = 100_000; a = rand(n); b = rand(n); c = rand(n);

julia> thecalculation(a, b, c) = sum(a .+ b .* c)
thecalculation (generic function with 1 method)

julia> function thecalculation_optimised(a, b, c)
           s = 0.0
           for i in eachindex(a)
               s += a[i] + b[i] * c[i]
           end
           s
       end
thecalculation_optimised (generic function with 1 method)

julia> @btime thecalculation($a, $b, $c);
  103.171 μs (2 allocations: 781.33 KiB)

julia> @btime thecalculation_optimised($a, $b, $c);
  106.209 μs (0 allocations: 0 bytes)

lrnv · November 12, 2020, 10:09am

Ok i get it now after a few tests.

Do you see anything else that could obviously be done to my code ?

Dictino · November 12, 2020, 10:24am

You can obtain substantial benefits from the algorithmic point of view (not a Julia thing but the way you compute the coefficients).
For example:

for i in 1:m
     FACTS[i] = factorial(i-1)
end

This calls factorial one time per iteration (slow) but you can use the fact that n!=n \cdot (n-1)! (the recursive definition of the factorial) and write simply:

FACTS[1]=1
for i in 2:m
       FACTS[i] = n*factorial(i-1)
end

Witch use only one multiplication per coefficient.

The same thing can be done with BINS (because binomial is defined in terms of factorials) and with the powers of two since (-2)^n=-2\cdot(-2)^{n-1} and by applying this together you can simplify the Laguerre coefficients computation.

In fact, the best way of computing the Laguerre coefficients is to use the recurrence relation (Laguerre polynomials - Wikipedia) and not the closed form you use.

lrnv · November 12, 2020, 10:27am

Yeah i know that, but i dont care since these are precomputations for me, not part of the final run where time matters. But thanks for the head’s up !

Under the assumption that accessing an array will always be faster than re-computing it, this will stay in the precomputations

Dictino · November 12, 2020, 11:08am

It depends on the problem sometimes is faster to just do the computation (if the computation are not very intensive…) but in your case I guess using a precomputed array is the way to go .

Looking at your code when I see prod(P.BINS[CartesianIndex.(k_minus_one_in_deg,j_arr)] it suggest that maybe there is a way to reorder the computations so that instead of calling prod you can do a multiplication for each k.

But to be fair I don’t understand fully what your code does because I’m not used to cartesianindex stuff

lrnv · November 12, 2020, 11:24am

If you want, this is equivalent to prod([P.BINS[i,j] for (i,j) in zip(k_minus_on_in_deg,j_arr)]).

Accumulating them, if this is what you mean, might indeed be possible. I’ll look into it. Thx

lmiq · November 12, 2020, 2:01pm

Probably you don’t need this if you expand the line where j_arr is used into an appropriate loop instead of trying to broadcast everything.

lrnv · November 12, 2020, 2:33pm

In the far far country where I came from, broadcasted calls are a lot faster than for loops. Old habbits… I gained 15% runtime by a few more loops instead of broadcasted calls ! Thanks for the hint.

stevengj · November 12, 2020, 2:41pm

The basic issue here is that, for BigFloat and BigInt (“bignum”) calculations, the arithmetic cost will typically dominate any algorithm. Almost all that matters is probably the number of bignum operations. Type stability, allocation of temporary arrays, and all of the other usual Julia considerations are likely to have relatively small impact. So, you need to focus on algorithmic improvements that reduce the number of arithmetic operations.

lrnv · November 12, 2020, 3:27pm

This is clearly an issue. I could do more precomputations to replace prod(P.BINS[CartesianIndex.(k_minus_one_in_deg,j_arr)]), which demands n product of bigfloats by only one indexing, but it gave me an OutOfMemory error if i want to cover all cases upfront

What about GitHub - JuliaArbTypes/ArbFloats.jl: Much faster than BigFloat at precisions up to 3,500 bits (1050 digits) ?

stevengj · November 12, 2020, 5:03pm

Yes, you could also try other bignum libraries. The built-in BigFloat uses GNU MPFR.

FedericoStra · November 22, 2020, 6:06pm

I know this is quite old, but I just wanted to note that there is an efficient one-liner that avoids memory allocation:

thecalculation_simpler(a, b, c) = sum(t -> muladd(t...), zip(b, c, a))

There are a lot of primitives already available, and sometimes it takes a bit of experimentation to find creative ways to combine them in an efficient manner.

Running

n = 1_000_000; a = rand(n); b = rand(n); c = rand(n);
@btime thecalculation(a, b, c)
@btime thecalculation_optimised(a, b, c)
@btime thecalculation_simpler(a, b, c)

for me reports

1.662 ms (3 allocations: 7.63 MiB)
1.359 ms (1 allocation: 16 bytes)
1.322 ms (1 allocation: 16 bytes)

lrnv · November 22, 2020, 7:46pm

I think you replied to the wrong thread, @FedericoStra

Topic		Replies	Views
And Julia keeps amazing me: high precision computation Community precision , bigfloat	18	2173	December 22, 2023
Memory allocations in for loop variable Performance question	28	3639	August 31, 2018
Simple speed comparison with Matlab New to Julia question	33	1259	May 6, 2025
Threads are not speeding up computation a lot Performance threads	70	1188	March 20, 2025
Julia is significantly slower (~18 x) than Matlab in vector and matrix algebra New to Julia	32	1917	June 25, 2023

Could this simple function go faster?

Related topics