Reduce allocations and speed up

daviddoij · July 24, 2021, 8:16am

Hi there,

in my attempt to teach myself Julia, I’m solving Project Euler problems. The following code works and give the correct answer, although I was expecting better performance, as the time is similar to what I got using Python. Here it is the code:

function chain_length(n, terms)
    length = 0
    while n != 1
        if haskey(terms, n)
            length += terms[n]
            break
        end
        if n % 2 == 0
            n = n / 2
        else
            n = 3n + 1
        end
        length += 1
    end
    return length
end

function Problem14()
    #=
    The following iterative sequence is defined for the set of positive
    integers:

    n → n/2 (n is even)
    n → 3n + 1 (n is odd)

    Using the rule above and starting with 13, we generate the following
    sequence:

    13 → 40 → 20 → 10 → 5 → 16 → 8 → 4 → 2 → 1

    It can be seen that this sequence (starting at 13 and finishing at 1)
    contains 10 terms. Although it has not been proved yet (Collatz Problem),
    it is thought that all starting numbers finish at 1.

    Which starting number, under one million, produces the longest chain?

    NOTE: Once the chain starts the terms are allowed to go above one million.
    =#

    ans = 0
    limit = 1_000_000
    score = 0
    terms = Dict()
    for i in 1:limit
        terms[i] = chain_length(i, terms)
        if terms[i] > score
            score = terms[i]
            ans = i
        end
    end
    return ans
end

thanks in advance.

jules · July 24, 2021, 8:52am

I see two problems: You do n = n / 2 but the division operator / returns Float64 even when called with Ints that can in theory be divided into an Int again. You want the ÷ operator (typed \div). This will make that part of the function type stable.

The other part that is type unstable is that you’re defining a plain Dict(), which is equal to Dict{Any,Any}), even though you know you are always going to store Int => Int pairs in it. So you should change that to Dict{Int, Int}(). With those two changes I get:

 0.393216 seconds (54 allocations: 65.170 MiB)

Whereas with your previous version I got

4.184902 seconds (17.95 M allocations: 339.074 MiB, 27.09% gc time)

jishnub · July 24, 2021, 9:01am

Allocations may further be reduced by using sizehint!

function chain_length(n, terms)
    length = 0
    while n != 1
        if haskey(terms, n)
            length += terms[n]
            break
        end
        if n % 2 == 0
            n = n ÷ 2
        else
            n = 3n + 1
        end
        length += 1
    end
    return length
end

function Problem14()
    #=
    The following iterative sequence is defined for the set of positive
    integers:

    n → n/2 (n is even)
    n → 3n + 1 (n is odd)

    Using the rule above and starting with 13, we generate the following
    sequence:

    13 → 40 → 20 → 10 → 5 → 16 → 8 → 4 → 2 → 1

    It can be seen that this sequence (starting at 13 and finishing at 1)
    contains 10 terms. Although it has not been proved yet (Collatz Problem),
    it is thought that all starting numbers finish at 1.

    Which starting number, under one million, produces the longest chain?

    NOTE: Once the chain starts the terms are allowed to go above one million.
    =#

    ans = 0
    limit = 1_000_000
    score = 0
    terms = Dict{Int, Int}()
    sizehint!(terms, limit)
    for i in 1:limit
        length_i = chain_length(i, terms)
        terms[i] = length_i
        if length_i > score
            score = length_i
            ans = i
        end
    end
    return ans
end

this leads to

julia> @time Problem14();
  0.379844 seconds (7 allocations: 34.001 MiB)

Edit: the shortcut of replacing (3n+1) by (3n+1)/2 helps:

function chain_length(n, terms)
    length = 0
    while n != 1
        if haskey(terms, n)
            length += terms[n]
            break
        end
        if n % 2 == 0
            n = n ÷ 2
        else
            n = (3n + 1) ÷ 2
            length += 1
        end
        length += 1
    end
    return length
end

julia> @time Problem14()
  0.314625 seconds (7 allocations: 34.001 MiB, 0.49% gc time)
837799

jishnub · July 24, 2021, 10:04am

This seems to be a problem that would benefit from more aggressive memoizaiton, but for some reason there’s no difference in performance (presumably because this needs to store more terms?).

Code

function get_chain_length!(terms, n)
    if haskey(terms, n)
        return terms[n]
    end
    if n % 2 == 0
        n_new = n ÷ 2
    else
        n_new = 3n + 1
    end
    length = get_chain_length!(terms, n_new) + 1
    terms[n] = length
    length
end

function Problem14()
    #=
    The following iterative sequence is defined for the set of positive
    integers:

    n → n/2 (n is even)
    n → 3n + 1 (n is odd)

    Using the rule above and starting with 13, we generate the following
    sequence:

    13 → 40 → 20 → 10 → 5 → 16 → 8 → 4 → 2 → 1

    It can be seen that this sequence (starting at 13 and finishing at 1)
    contains 10 terms. Although it has not been proved yet (Collatz Problem),
    it is thought that all starting numbers finish at 1.

    Which starting number, under one million, produces the longest chain?

    NOTE: Once the chain starts the terms are allowed to go above one million.
    =#

    ans = 0
    limit = 1_000_000
    score = 0
    terms = Dict{Int, Int}()
    sizehint!(terms, 2limit)
    terms[1] = 0
    for i in 2:limit
        length_i = get_chain_length!(terms, i)
        if length_i > score
            score = length_i
            ans = i
        end
    end
    return ans
end

julia> @time Problem14()
  0.363303 seconds (7 allocations: 68.001 MiB)
837799

A marginal improvement is gained by replacing (3n+1) by (3n+1)/2:

function get_chain_length!(terms, n)
    if haskey(terms, n)
        return terms[n]
    end
    length = 0
    if n % 2 == 0
        n_new = n ÷ 2
    else
        n_new = (3n + 1) ÷ 2
        length = 1
    end
    length += get_chain_length!(terms, n_new) + 1
    terms[n] = length
    length
end

julia> @time Problem14()
  0.315037 seconds (7 allocations: 68.001 MiB)
837799

cjdoris · July 24, 2021, 3:32pm

Since all the keys in terms are small positive integers, it may be faster to use a Vector{Int} instead, using a sentinel value like -1 to indicate missing values.

Skoffer · July 24, 2021, 4:02pm

Potentially, you can speed up calculations with bitwise operations. Instead of n % 2 == 0 you can do n & 1 == 0. Maybe compiler can figure it out on it’s own, but at least it’s worth a try. Same goes for n \div 2, you can use n >> 1

Sukera · July 24, 2021, 4:17pm

I’d use iseven or isodd here instead of the modulus check.

DNF · July 24, 2021, 5:13pm

divrem can perhaps be used here.

jules · July 24, 2021, 6:57pm

Interestingly, the “dumb” version without memoization is faster for me:

function chain_length(n)
    length = 0
    while n != 1
        if iseven(n)
            n = n ÷ 2
        else
            n = 3n + 1
        end
        length += 1
    end
    return length
end

function problem_14(range)
    k = 0
    l = 0
    for i in range
        ll = chain_length_2(i)
        if ll > l
            k = i
            l = ll
        end
    end
    k
end

julia> @time problem_14(1:1_000_000)

  0.263881 seconds
837799 => 524

And this dumb version can also be threaded easily, and it scales perfectly because it’s complete divide and conquer:

function problem_14_threads()
    ks = fill(0=>0, 4)
    x = 250_000

    Threads.@threads for t in 1:4
        rng = (t-1) * x + 1 : t * x
        ks[t] = problem_14(rng)
    end
    ks[argmax(last.(ks))][1]
end

julia> @time problem_14_threads()

0.080029 seconds (26 allocations: 1.969 KiB)
837799

dmbates · July 24, 2021, 8:16pm

It may not help much with speed but you could build around get! if you want to memoize the values.

const lengths = Dict{Int, Int}(1 => 1)
function getlength!(lengths, n)
    get!(lengths, n) do 
        1 + getlength!(lengths, isodd(n) ? 3n + 1 : (n >> 1))
    end
end
findmax([getlength!(lengths, i) for i in 1:1000000])

I haven’t worked out how to avoid the comprehension in the call to findmax. You don’t need to store the values if you are just trying to get the maximum but I haven’t decided how to make the expression in the comprehension into an iterator.

dmbates · July 24, 2021, 10:04pm

So I did manage to work out how to use a generator expression as an argument to findmax. You construct a key-value pair but, curiously the “key” is the array value and the “value” is the index. (If I would have looked at the result of findmax that would have given me the hint.) So the above can be written

julia> findmax(getlength!(lengths, i) => i for i in 1:1000000)
(525, 837799)

To benchmark this approach fairly the Dict should be reinitialized for each evaluation.

julia> @benchmark findmax(Pair(getlength!(lengths, i), i) for i in 1:1000000)  setup=(lengths = Dict{Int,Int}(1 => 1))
BenchmarkTools.Trial: 20 samples with 1 evaluation.
 Range (min … max):  245.319 ms … 308.182 ms  ┊ GC (min … max): 0.73% … 20.86%
 Time  (median):     249.373 ms               ┊ GC (median):    1.07%
 Time  (mean ± σ):   255.664 ms ±  17.785 ms  ┊ GC (mean ± σ):  3.07% ±  5.19%

   ▂▂ █                                                          
  ██████▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  245 ms           Histogram: frequency by time          308 ms <

 Memory estimate: 133.17 MiB, allocs estimate: 56.

An alternative, recursive function that does not memoize would be

getlengthr(n) = isone(n) ? 1 : 1 + getlengthr(isodd(n) ? 3n + 1 : (n >> 1))

but that is a little slower

julia> @benchmark findmax(getlengthr(i) => i for i in 1:1000000)
BenchmarkTools.Trial: 17 samples with 1 evaluation.
 Range (min … max):  300.419 ms … 304.137 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     301.531 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   301.979 ms ±   1.204 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁ ▁ ▁     ▁  █ ▁ ▁▁     ▁   ▁ ▁       █                  ▁  █  
  █▁█▁█▁▁▁▁▁█▁▁█▁█▁██▁▁▁▁▁█▁▁▁█▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█ ▁
  300 ms           Histogram: frequency by time          304 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

Turns out for me (single-threaded), a direct, non-recursive version

function getlength(n)
    count = 1
    while !isone(n)
        n = isodd(n) ? 3n + 1 : (n >> 1)
        count += 1
    end
    return count
end

is fastest

julia> @benchmark findmax(getlength(i) => i for i in 1:1000000)
BenchmarkTools.Trial: 44 samples with 1 evaluation.
 Range (min … max):  111.425 ms … 116.422 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     113.932 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   113.910 ms ± 856.555 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                     ▁▁    ▄ ▄  ▁  ▄█  ▁▄▁  ▁▁                   
  ▆▁▁▁▁▁▁▁▁▁▁▁▁▆▆▁▁▆▆██▆▁▆▆█▆█▆▆█▁▁██▆▁███▁▆██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  111 ms           Histogram: frequency by time          116 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

Seif_Shebl · July 25, 2021, 3:32am

This takes only 77 ms on my computer. It’s 2.6X faster than @jules version above. Two small tricks used here,
(1) use n >> 1 as suggested by @Skoffer instead of n ÷ 2, surprisingly the compiler couldn’t do it, and
(2) use the range 1:2:1_000_000 which excludes even numbers which drop in value very quickly, as opposed to odd starting values which increase.

function chain_length(n)
    length = 0
    while n > 1
        n = iseven(n) ? n >> 1 : 3n + 1
        length += 1
    end
    return length
end

function problem_14(rng)
    k = l = 0
    for i in rng
        ll = chain_length(i)
        if ll > l
            k = i
            l = ll
        end
    end
    k
end
@btime problem_14(1:2:1_000_000)
  77.016 ms (0 allocations: 0 bytes)
837799

daviddoij · July 25, 2021, 7:10am

wow! That’s remarkable. Thx for sharing with a newbie

Sukera · July 25, 2021, 9:12am

I think you can get rid of the branch by writing

iseven(n) * (n >> 1) + isodd(n) * (3n + 1)

Or hoisting the check outside and inverting the boolean in the above expression.

Not sure if this is enough to get it to SIMD, there may need to be some more tricky transform with the loop condition necessary.

Skoffer · July 25, 2021, 10:13am

We can do it even slightly better

function chain_length(n)
    length = 0
    while n > 1
        n = 3n + 1
        k = trailing_zeros(n)
        n >>= k
        length += k + 1
    end
    return length
end

With this chain_length implementation, time has decreased from 79ms to 52ms on my laptop.

With multithreading (it can be erroneous though on interval edges, so it should be written more carefully, but it still yields correct result in this case for 4 threads)

function problem_14_threads()
    n = Threads.nthreads()
    ks = zeros(Int, n)
    x = 1_000_000 ÷ n

    Threads.@threads for t in 1:n
        rng = (t-1) * x + 1:2:t * x
        ks[t] = problem_14(rng)
    end
    maximum(ks)
end

julia> @btime problem_14_threads()
  15.294 ms (22 allocations: 1.98 KiB)
837799

Seif_Shebl · July 27, 2021, 1:12am

For fun, here is an unbeatable 6 ms version combining bit twidling magic by @Skoffer with memoization. I was lazy and used global consts for maxn and cache array.

function chain_length(n)
    res = 0
    nxt = 3n + 1
    k = trailing_zeros(nxt)
    nxt >>= k
    if nxt >= maxn || nxt < maxn && iszero(cache[nxt])
        res = k + 1 + chain_length(nxt)
    else
        res = k + 1 + cache[nxt]
    end
    if n < maxn
        cache[n] = res
    end
    return res
end

function problem_14(rng)
    cache[1] = 1
    best = 1
    for n in rng
        if iszero(cache[n])
            if cache[best] < chain_length(n) 
                best = n
            end
        end
    end
    best, cache[best]
end

const maxn = 1_000_000
const cache = zeros(Int,maxn)
@btime problem_14(1:2:10^6) setup = (cache .= 0)
  6.039 ms (0 allocations: 0 bytes)
(837799, 525)