Help with speeding up this code

bertschi · July 9, 2023, 10:04pm

If not all performance needs to be squeezed out, just memoizing the straight-forward implementation might already be enough:

using Memoize

@memoize Dict function facm(n::T) where {T <: Integer}
    if iszero(n)
        one(T)
    else
        n * facm(n - 1)
    end
end

@memoize Dict function loopm(n::T, k::T) where {T <: Integer}
    if n < zero(T) || k <= zero(T)
        error("Invalid values")
    elseif n > zero(T) && n < k
        error("Invalid values")
    end

    if k == one(T) || n == zero(T)
        return one(T)
    end
    cnt = zero(T)
    for j = 1:(n÷k)
        m = min(k-1, n-j*k)
        if m == zero(T)
             m = one(T)
        end
        recur = sum(loopm(n-j*k, t) for t = 1:m)
        cnt += recur*(facm(n) ÷ facm(n-j*k)) ÷ (facm(j)*(k^j))
    end
    return cnt
end

function main(n)
    [loopm(n,i) / facm(n) for i = 1:n]
end

stevengj · July 9, 2023, 11:17pm

My generic advice here is that if you want recursion to be fast (in any language), you should ideally enlarge the base case (“coarsen” the recursion) — see this comment.

(That being said, maybe with memo-ization here you won’t call the trivial base cases enough to matter?)

Dan · July 9, 2023, 11:56pm

Did a little refactoring, as gdalle said, to make the code totally unrecognizable. Specifically, used the semantic meaning of the loops values to find a more simply calculated expression.
It goes like this (using memoization also):

# returns number of perms of 1..n with largest cycle of size k
function loop!(M,CM,n,k)
    n < k && return 0.0  # base and trivial cases
    n == 0 && k == 0 && return 1.0
    k <= 0 && return 0.0
    n <= 0 && return 0.0
    M[n,k] >= 0.0 && return M[n,k]  # memoization
    if k == 1
        M[n, k] = 1.0
        return 1.0
    end
    factor = 1.0   # non base case calculation
    res = 0.0
    for i in 1:k-1
        res += factor * loop!(M,CM,n-i,k)
        factor *= n-i
    end
    if n>k
        j = k
        while j>1 && CM[n-k,j] < 0.0
            j -= 1
        end
        for i in j:k   # second memoization for M cumsum
            CM[n-k, i] = (i>1 ? CM[n-k, i-1] : 0.0) + loop!(M,CM,n-k, i)
        end    
        res += factor * CM[n-k,k]
    else
        res += factor
    end
    M[n,k] = res
    return res
end

To use this function, the memoization matrices need to be initialized and passed as parameters, as follows:

julia> const M = fill(-1.0,100,100);

julia> const CM = fill(-1.0,100,100);

julia> @time [loop!(M, CM, 100, i) for i in 1:100];
  0.027222 seconds (23.15 k allocations: 1.561 MiB, 98.51% compilation time)

The timing is not very informative, ~~as I’m not including any other time to compare with~~. But it seems at least >1000x speedup.

UPDATE: To benchmark, adding a main as in previous posts:

function main(n)
    M = fill(-1.0, n,n)
    CM = fill(-1.0, n,n)
    y = Vector{Float64}(undef, n)
    fac = cumprod(float.(1:n))
    for i = 1:n
        y[i] = loop!(M, CM, n, i) / fac[n]
    end
    return y
end

with this main and comparing to one of the original oldmains:

julia> @time oldmain(100);
  5.305958 seconds (4 allocations: 3.500 KiB)

julia> @time main(100);
  0.001171 seconds (7 allocations: 158.969 KiB)

( 5000x speedup )

gdalle · July 10, 2023, 7:34am

Since you kept the floats this might come at the cost of numerical errors that are hard to quantify. I’d love to give IntervalArithmetic a go here, to figure out how much precision we lose when giving up on BigInt

Mark_Szlazak · July 10, 2023, 5:11pm

I tried Dan’s code at n=100 and it is fastest but that is not the case when I tried it with n=200. The code of gdalle with better memoization near the top of this thread runs a lot faster at n=200. Here is what I measured:

Dan(200):
1004.275557 seconds (61.17 k allocations: 4.950 MiB, 0.01% compilation time)

gdale(200):
2.237682 seconds (23.36 M allocations: 488.616 MiB, 25.59% gc time, 5.58% compilation time)

Also at n=100, the bar graph with Dan’s code is what it should be. However at n=200 for Dan’s code, I only get the first couple of bars in bar graph and nothing more. Many similar warnings to the following accompany the bar graph:

┌ Warning: Indices Base.OneTo(200) of attribute seriescolor does not match data indices 1:4.
└ @ Plots C:\Users\Mark.julia\packages\Plots\QWCV6\src\utils.jl:141
┌ Info: Data contains NaNs or missing values, and indices of seriescolor vector do not match data indices.
│ If you intend elements of seriescolor to apply to individual NaN-separated segments in the data,
│ pass each segment in a separate vector instead, and use a row vector for seriescolor. Legend entries
│ may be suppressed by passing an empty label.
│ For example,
└ plot([1:2,1:3], [[4,5],[3,4,5]], label=[“y” “”], seriescolor=[1 2])

gdalle · July 10, 2023, 8:05pm

That’s weird. Maybe when we memoize too much we end up spending a lot of time in the memory access

Dan · July 10, 2023, 8:44pm

There seems to be some overflow (or other trouble) to produce NaNs. Also, my version uses a lot more memory (it says 5GB) which could potentially exhaust the memory on the machine which causes serious slowdown.

One workaround would be to use 128 float numbers. This can be found in DoubleFloats package (GitHub - JuliaMath/DoubleFloats.jl: math with more good bits ).

It’s worth finding the first n for which Dan(n) starts misbehaving (for easier debugging).

ADDENDUM:

The maximum number representable in Float64 is 1.7E+308.
While 200! is 7.9E+374. So that’s a problem.

The DoubleFloat solution should work for that. As for the speed, it may be related, as NaN would mess up the memoization conditions.

Oscar_Smith · July 10, 2023, 8:53pm

doublefloats won’t help here. they don’t have a wider range. you might want float128 instead.

Dan · July 10, 2023, 8:58pm

Thanks. I was just quickly googling longer floats. Without a REPL open.
Float128 is provided by the Quadmath package.
Example:

julia> using Quadmath

julia> prod([Float128(i) for i in 1:200])
7.88657867364790503552363213932184275e+374

Dan · July 10, 2023, 9:11pm

The fixed up code works okay for n=200:

using Quadmath

# returns number of perms of 1..n with largest cycle of size k
function loop!(M::Matrix{T},CM::Matrix{T},n::Integer,k::Integer) where {T}
    n < k && return zero(T)  # base and trivial cases
    n == 0 && k == 0 && return one(T)
    k <= 0 && return zero(T)
    n <= 0 && return zero(T)
    M[n,k] >= 0 && return M[n,k]  # memoization
    if k == 1
        M[n, k] = one(T)
        return one(T)
    end
    factor = one(T)   # non base case calculation
    res = zero(T)
    for i in 1:k-1
        res += factor * loop!(M,CM,n-i,k)
        factor *= n-i
    end
    if n>k
        j = k
        while j>1 && CM[n-k,j] < 0
            j -= 1
        end
        for i in j:k   # second memoization for M cumsum
            CM[n-k, i] = (i>1 ? CM[n-k, i-1] : zero(T)) + loop!(M,CM,n-k, i)
        end    
        res += factor * CM[n-k,k]
    else
        res += factor
    end
    M[n,k] = res
    return res
end

function main(::Type{T}, n) where {T<:Real}
    M = fill(-one(T), n,n)
    CM = fill(-one(T), n,n)
    y = Vector{T}(undef, n)
    fac = cumprod(T.(1:n))
    for i = 1:n
        y[i] = loop!(M, CM, n, i) / fac[n]
    end
    return y
end

main(n) = main(Float64, n)

With these definitions:

julia> main(10)
10-element Vector{Float64}:
 2.755731922398589e-7
 0.00261656746031746
...
julia> main(Float128, 200)
200-element Vector{Float128}:
 1.26797695348096242175301637107480335e-375
 4.65660197354737973222078346996522418e-183
...

Mark_Szlazak · July 10, 2023, 10:45pm

Dan(200):
0.200233 seconds (76.84 k allocations: 6.629 MiB, 43.89% compilation time)

stevengj · July 26, 2023, 8:37pm

3 posts were split to a new topic: Question about sum(x → …, 1:n) notation

Topic		Replies	Views
Repeatedly multiplying a Float64 is fastest in chunks of <51 terms Performance	11	1183	March 28, 2020
Need help with performance in a very large loop Performance performance	28	1376	May 28, 2020
Reduce allocations and speed up Performance	15	1515	July 27, 2021
For loop performance vs "functional" performance Performance advent-of-code	17	516	January 4, 2025
Can this be written even faster? (CPU) Performance	34	844	February 9, 2024

Help with speeding up this code

Related topics