Help with speeding up this code

If not all performance needs to be squeezed out, just memoizing the straight-forward implementation might already be enough:

using Memoize

@memoize Dict function facm(n::T) where {T <: Integer}
    if iszero(n)
        one(T)
    else
        n * facm(n - 1)
    end
end

@memoize Dict function loopm(n::T, k::T) where {T <: Integer}
    if n < zero(T) || k <= zero(T)
        error("Invalid values")
    elseif n > zero(T) && n < k
        error("Invalid values")
    end

    if k == one(T) || n == zero(T)
        return one(T)
    end
    cnt = zero(T)
    for j = 1:(n÷k)
        m = min(k-1, n-j*k)
        if m == zero(T)
             m = one(T)
        end
        recur = sum(loopm(n-j*k, t) for t = 1:m)
        cnt += recur*(facm(n) ÷ facm(n-j*k)) ÷ (facm(j)*(k^j))
    end
    return cnt
end

function main(n)
    [loopm(n,i) / facm(n) for i = 1:n]
end
2 Likes

My generic advice here is that if you want recursion to be fast (in any language), you should ideally enlarge the base case (“coarsen” the recursion) — see this comment.

(That being said, maybe with memo-ization here you won’t call the trivial base cases enough to matter?)

2 Likes

Did a little refactoring, as gdalle said, to make the code totally unrecognizable. Specifically, used the semantic meaning of the loops values to find a more simply calculated expression.
It goes like this (using memoization also):

# returns number of perms of 1..n with largest cycle of size k
function loop!(M,CM,n,k)
    n < k && return 0.0  # base and trivial cases
    n == 0 && k == 0 && return 1.0
    k <= 0 && return 0.0
    n <= 0 && return 0.0
    M[n,k] >= 0.0 && return M[n,k]  # memoization
    if k == 1
        M[n, k] = 1.0
        return 1.0
    end
    factor = 1.0   # non base case calculation
    res = 0.0
    for i in 1:k-1
        res += factor * loop!(M,CM,n-i,k)
        factor *= n-i
    end
    if n>k
        j = k
        while j>1 && CM[n-k,j] < 0.0
            j -= 1
        end
        for i in j:k   # second memoization for M cumsum
            CM[n-k, i] = (i>1 ? CM[n-k, i-1] : 0.0) + loop!(M,CM,n-k, i)
        end    
        res += factor * CM[n-k,k]
    else
        res += factor
    end
    M[n,k] = res
    return res
end

To use this function, the memoization matrices need to be initialized and passed as parameters, as follows:

julia> const M = fill(-1.0,100,100);

julia> const CM = fill(-1.0,100,100);

julia> @time [loop!(M, CM, 100, i) for i in 1:100];
  0.027222 seconds (23.15 k allocations: 1.561 MiB, 98.51% compilation time)

The timing is not very informative, as I’m not including any other time to compare with. But it seems at least >1000x speedup.

UPDATE: To benchmark, adding a main as in previous posts:

function main(n)
    M = fill(-1.0, n,n)
    CM = fill(-1.0, n,n)
    y = Vector{Float64}(undef, n)
    fac = cumprod(float.(1:n))
    for i = 1:n
        y[i] = loop!(M, CM, n, i) / fac[n]
    end
    return y
end

with this main and comparing to one of the original oldmains:

julia> @time oldmain(100);
  5.305958 seconds (4 allocations: 3.500 KiB)

julia> @time main(100);
  0.001171 seconds (7 allocations: 158.969 KiB)

( 5000x speedup )

4 Likes

Since you kept the floats this might come at the cost of numerical errors that are hard to quantify. I’d love to give IntervalArithmetic a go here, to figure out how much precision we lose when giving up on BigInt

I tried Dan’s code at n=100 and it is fastest but that is not the case when I tried it with n=200. The code of gdalle with better memoization near the top of this thread runs a lot faster at n=200. Here is what I measured:

Dan(200):
1004.275557 seconds (61.17 k allocations: 4.950 MiB, 0.01% compilation time)

gdale(200):
2.237682 seconds (23.36 M allocations: 488.616 MiB, 25.59% gc time, 5.58% compilation time)

Also at n=100, the bar graph with Dan’s code is what it should be. However at n=200 for Dan’s code, I only get the first couple of bars in bar graph and nothing more. Many similar warnings to the following accompany the bar graph:

┌ Warning: Indices Base.OneTo(200) of attribute seriescolor does not match data indices 1:4.
└ @ Plots C:\Users\Mark.julia\packages\Plots\QWCV6\src\utils.jl:141
┌ Info: Data contains NaNs or missing values, and indices of seriescolor vector do not match data indices.
│ If you intend elements of seriescolor to apply to individual NaN-separated segments in the data,
│ pass each segment in a separate vector instead, and use a row vector for seriescolor. Legend entries
│ may be suppressed by passing an empty label.
│ For example,
└ plot([1:2,1:3], [[4,5],[3,4,5]], label=[“y” “”], seriescolor=[1 2])

That’s weird. Maybe when we memoize too much we end up spending a lot of time in the memory access

There seems to be some overflow (or other trouble) to produce NaNs. Also, my version uses a lot more memory (it says 5GB) which could potentially exhaust the memory on the machine which causes serious slowdown.

One workaround would be to use 128 float numbers. This can be found in DoubleFloats package (GitHub - JuliaMath/DoubleFloats.jl: math with more good bits ).

It’s worth finding the first n for which Dan(n) starts misbehaving (for easier debugging).

ADDENDUM:

The maximum number representable in Float64 is 1.7E+308.
While 200! is 7.9E+374. So that’s a problem.

The DoubleFloat solution should work for that. As for the speed, it may be related, as NaN would mess up the memoization conditions.

doublefloats won’t help here. they don’t have a wider range. you might want float128 instead.

1 Like

Thanks. I was just quickly googling longer floats. Without a REPL open.
Float128 is provided by the Quadmath package.
Example:

julia> using Quadmath

julia> prod([Float128(i) for i in 1:200])
7.88657867364790503552363213932184275e+374

The fixed up code works okay for n=200:

using Quadmath

# returns number of perms of 1..n with largest cycle of size k
function loop!(M::Matrix{T},CM::Matrix{T},n::Integer,k::Integer) where {T}
    n < k && return zero(T)  # base and trivial cases
    n == 0 && k == 0 && return one(T)
    k <= 0 && return zero(T)
    n <= 0 && return zero(T)
    M[n,k] >= 0 && return M[n,k]  # memoization
    if k == 1
        M[n, k] = one(T)
        return one(T)
    end
    factor = one(T)   # non base case calculation
    res = zero(T)
    for i in 1:k-1
        res += factor * loop!(M,CM,n-i,k)
        factor *= n-i
    end
    if n>k
        j = k
        while j>1 && CM[n-k,j] < 0
            j -= 1
        end
        for i in j:k   # second memoization for M cumsum
            CM[n-k, i] = (i>1 ? CM[n-k, i-1] : zero(T)) + loop!(M,CM,n-k, i)
        end    
        res += factor * CM[n-k,k]
    else
        res += factor
    end
    M[n,k] = res
    return res
end

function main(::Type{T}, n) where {T<:Real}
    M = fill(-one(T), n,n)
    CM = fill(-one(T), n,n)
    y = Vector{T}(undef, n)
    fac = cumprod(T.(1:n))
    for i = 1:n
        y[i] = loop!(M, CM, n, i) / fac[n]
    end
    return y
end

main(n) = main(Float64, n)

With these definitions:

julia> main(10)
10-element Vector{Float64}:
 2.755731922398589e-7
 0.00261656746031746
...
julia> main(Float128, 200)
200-element Vector{Float128}:
 1.26797695348096242175301637107480335e-375
 4.65660197354737973222078346996522418e-183
...
1 Like

Dan(200):
0.200233 seconds (76.84 k allocations: 6.629 MiB, 43.89% compilation time)

1 Like

3 posts were split to a new topic: Question about sum(x → …, 1:n) notation