Help diagnosing a slow iterator

Note, this was asked on Zulip a week ago, but received no responses. Reposting in the hope that someone else might see and comment


I have a tree-like structure, composed of a mix of types, and I’m looking to iterate through each of them in a DFS-like manner. I’ve managed to get some code working, but it’s much slower than I expected and I’m not sure why.

I use a wrapper type for the iterator, and the idea is that I keep the current “tree path” as a stack, if from the current node I can decent further the first child is added to the stack, and when the iterator at the end of the stack is exhausted, the stack is pop!ed.

Here’s the full code for the iterator (for later: line 1 below corresponds to line 1 of the file)

import Base: iterate, length

struct OrgIterator
    o::Org
end

Base.IteratorSize(::OrgIterator) = Base.SizeUnknown()
iterate(it::OrgIterator) =
    if length(it.o.contents) > 0
        el, state = iterate(it.o)
        (el, Vector{Tuple}([(it.o, state), (el, nothing)]))
    end
iterate(it::OrgIterator, stack::Vector) =
    if length(stack) > 0
        next = if isnothing(stack[end][2])
            iterate(stack[end][1])
        else
            iterate(stack[end][1], stack[end][2])
        end
        if isnothing(next)
            pop!(stack)
            iterate(it, stack)
        else
            el, state = next
            stack[end] = (stack[end][1], state)
            if applicable(iterate, el)
                push!(stack, (el, nothing))
            end
            (el, stack)
        end
    end

This takes 1.2s to run on a structure with 26262 nodes. I have profiled Iterators.filter(c -> c isa FootnoteRef, OrgIterator(m)) |> collect (which just matches ~130 nodes) and looking at Profile.print() (full result: http://ix.io/3L3w) it appears that other than spending ages on poptask (which seems to always be the case in profiles?) the bulk of the time looks to be spent on line 26, i.e. if applicable(iterate, el).
However, benchmarking some call of applicable (I assume that the performance is the same for any type) shows it taking around 140ns per call, so calling that on every node should take ~4000000ns or 0.004s, which wouldn’t explain the poor performance.

So, I’m somewhat perplexed. Might anyone have some idea as to what is going on and how I can improve this?

I think it would be helpful to edit the post to include the Org type, some information on what its instance would look like, and the remaining relevant methods. For example, the iterate(it.o) line has the argument types iterate(::Org). What does that do? I can see it returns an el, state, and there’s the iterate(stack[end][1]) line later, which should do iterate(el). I can’t guess the argument types; what does that do?

It would also be helpful to include the code you use to measure the performance, including comments on your procedure.

Thanks for getting back to me Benny.

For reference, the code base in question is GitHub - tecosaur/OrgMode.jl: A Julia library for working with Org, WIP.
The Org type is fairly simple, ignoring irrelevant fields it’s basically a vector of OrgElements.
Some OrgElements contain other OrgElements, some contain OrgObjects. Some OrgObjects contain OrgObjects.

The end-result is that you can get a tree like this

Org
 - OrgElement
   - OrgElement
     - OrgObject
 - OrgElement
   - OrgObject
    - OrgObject
 - OrgElement

Each of these types inside Org has an iterator defined which will return all of its children.
Leaf nodes of this tree are detected by applicable(iterate, el).

The stack state variable is basically the current “path” in the tree, which allows for the DFS-like procedure to be iterated.

When a node can be descended into, it is appended to the stack. At each step, we return the node.
When a leaf node is reached, we move “up” the tree, and so remove the last item of stack.

This procedure is repeated until stack is empty.

For timing the Iterators.filter call, I just used @time since it takes 1.2s.
For timing the applicable(iterate, el) I picked some el from the tree and ran @benchmark applicable(iterate, $el).

I hope that helps make it clearer what I’m trying to do here.

I don’t know anything about applicable, but is it maybe possible to replace with el isa something?

Have you tried @code_warntype? Is the iterator allocating a lot?

I just had a look with @code_warntype, but I’m not really sure what to make of it. I can’t replace applicable with el isa because it varies based on the ~30 odd types I’m looking at.

I think I may have found a large part of the performance issue though. Testing applicable with both methods that are and aren’t applicable, it seems that calling applicable takes ages when the result is false. I’m rather astounded by the difference actually.

Above: applicable(iterate, .) -> false. Below: applicable(iterate, .) -> true.

So, I’m thinking that maybe it would be worth defining a function terminal(T) which shadows the result of !applicable(iterate, T), but is hardcoded.

Just doing a quick test, this looks like it could be an effective approach:

I’ll try this in the next few days, and update this thread.

Ok, defining terminal(el)::Bool has resulted in a ~8x speedup. 1.2s → 0.17s. This is a lot better.

It’s a pity that this still takes longer than creating the entire parsetree from scratch (0.15s).