Slow Custom Digits Iterator

I was trying to replace calls to digits, which allocates, with a custom iterator, but for some reason calling sum on it is really slow and allocates, despite e.g. calls to collect being fast:

using BenchmarkTools

struct DigitsIterator
    base::Int
    n::Int
end

@inline Base.iterate(it::DigitsIterator) = it.n == 0 ? (0,0) : iterate(it,it.n)
@inline Base.iterate(it::DigitsIterator,el) = el == 0 ? nothing : reverse(divrem(el,it.base))
Base.length(it::DigitsIterator) = ndigits(it.n,base=it.base)
Base.eltype(::Type{DigitsIterator}) = Int

@benchmark sum(DigitsIterator(10,x)) setup=x=rand(0:10^10)

returns

BenchmarkTools.Trial: 
  memory estimate:  304 bytes
  allocs estimate:  8
  --------------
  minimum time:     2.111 μs (0.00% GC)
  median time:      2.241 μs (0.00% GC)
  mean time:        2.469 μs (5.69% GC)
  maximum time:     1.411 ms (99.53% GC)
  --------------
  samples:          10000
  evals/sample:     9

Any idea what could be going on?

Hmm, profiling shows it’s in mapfoldl_impl. Looks like a specialization failure:

julia> @btime sum(DigitsIterator(10,x)) setup=x=rand(0:10^10)
  590.500 ns (5 allocations: 112 bytes)
43

julia> Revise.track(Base)

julia> @btime sum(DigitsIterator(10,x)) setup=x=rand(0:10^10)
  31.333 ns (0 allocations: 0 bytes)

where the diff from what Julia was compiled with is

diff --git a/base/reduce.jl b/base/reduce.jl
index 425d8d6f28..b37fc81f9d 100644
--- a/base/reduce.jl
+++ b/base/reduce.jl
@@ -36,7 +36,7 @@ mul_prod(x::Real, y::Real)::Real = x * y
 
 ## foldl && mapfoldl
 
-function mapfoldl_impl(f, op, nt::NamedTuple{(:init,)}, itr, i...)
+function mapfoldl_impl(f::F, op::OP, nt::NamedTuple{(:init,)}, itr, i...) where {F,OP}
     init = nt.init
     # Unroll the while loop once; if init is known, the call to op may
     # be evaluated at compile time
@@ -51,7 +51,7 @@ function mapfoldl_impl(f, op, nt::NamedTuple{(:init,)}, itr, i...)
     return v
 end
 
-function mapfoldl_impl(f, op, nt::NamedTuple{()}, itr)
+function mapfoldl_impl(f::F, op::OP, nt::NamedTuple{()}, itr) where {F,OP}
     y = iterate(itr)
     if y === nothing
         return Base.mapreduce_empty_iter(f, op, itr, IteratorEltype(itr))

Probably worth changing. I’ll submit a PR.

5 Likes

Thanks for the investigation!

I’ve been seeing this kind of effect in a few places lately.

I thought that type specifications were in principle only for filtering methods, and should not affect performance. What are the rules that decide whether a function gets specialized or not? Did these heuristics changes since the 0.5 days?

This makes programming less predicable, often in a very obscure way…

Thanks to @iamed2 it now has its own documentation section:
https://docs.julialang.org/en/latest/manual/performance-tips/#Be-aware-of-when-Julia-avoids-specializing-1

Did these heuristics change since the 0.5 days?

I think if anything it’s better than in 0.5. It’s just the Julia’s performance is now so predictably good, this kind of failure really stands out.

2 Likes

The new docs say:

Julia will always specialize when the argument is used within the method

but mapfoldl_impl does use its f and op arguments.

Only the inner version does, not the outer one with no init argument.

Right, so this method shouldn’t need specialization and so I am assuming that there are no issues there. Is that not the case?

I am trying to understand why there is this heuristic.

If here the outer method does require specialization, even though it does not use (which I guess means call) its arguments, then the heuristic doesn’t make sense.