Iterative construction of lazy Iterators leads to excessive compilation time

I wanted to write an iterator that lazily traverses a graph in breadth first search order, where the graph is defined implicitly by a function giving for each vertex an iterable over the adjacent vertices. This is what I wrote:

struct ImplicitGraph
        adj::Function
end

struct TraversalBFS{T}
        graph::ImplicitGraph
        start::T
end

G = ImplicitGraph(v -> (v - 1, v + 1))

t = TraversalBFS(G, 0)

function Base.iterate(t::TraversalBFS{T}, state=((t.start,), Set{T}())) where {T}
        queue, visited = state                                                         
        p = Iterators.dropwhile(w -> w ∈ visited, queue) |> Iterators.peel
        p === nothing && return nothing
        v, to_visit = p
        push!(visited, v)                                                                   
        return v, (Iterators.flatten((to_visit, t.graph.adj(v))), visited)
end

for v in t
        print(v)
end

In this example we try to traverse the “graph of integers” starting at 0.
The problem is that this code runs at a ridiculous speed: it seems like the code is constantly being compiled.
Even if the above shown algorithm may not be the best way to achieve what I wanted, it is a perfectly reasonable algorithm. As a comparison, I wrote an analogous version using Python:

import itertools

t = { 'adj': lambda v: (v-1, v+1), 'start': 0 }

initial_state = ((t['start'],), set())

def iterate(t, state):
    queue, visited = state
    to_visit = itertools.dropwhile(lambda w: w in visited, queue)
    v = next(to_visit, None)
    if v is None:
        return None
    visited.add(v)
    return v, (itertools.chain.from_iterable((to_visit, t['adj'](v))), visited)

v, state = iterate(t, initial_state)
print(v)
while True:
    v, state = iterate(t, state)
    print(v)

As you can see, this code runs fast.

The problem with the Julia code is that the data structure of the state of the iterator becomes ever more complex: with every iteration the type-tree grows and grows and Julia overspecializes on these data-structures.

Given that, how is one supposed to write such kind of lazy algorithms in Julia?

(This question is a follow up of the issue I submitted on GitHub: Iterative construction of lazy Iterators leads to excessive compilation time · Issue #52295 · JuliaLang/julia · GitHub)

1 Like

Well the problem here is that you use Tuple to store the queue and thus Julia needs to specialize for every new length of the queue (roughly speaking) and that gets complex very fast.

All you need to do is change that to a vector like so:

function Base.iterate(t::TraversalBFS{T}, state=([t.start], Set{T}())) where {T}
        queue, visited = state
        filter!(w -> !(w ∈ visited), queue)
        isempty(queue) && return nothing
        v = popfirst!(queue)
        append!(queue, t.graph.adj(v))
        return v, (queue, visited)
end

Note that this also reuses the same vector over and over so it shouldn’t be that bad allocation-wise (e.g. filter! and pop! won’t allocate).

Conceptually, in my version TraversalBFS is not backed by an iterator of remaining elements but rather a stack datastructure.

EDIT: Replaced pop! with popfirst! otherwise it would be a depth-first not breadth-first.

3 Likes

I think the reason why the Python version of the lazy algorithm runs fast and and the Julia version is abysmally slow is that the lazy iterators in Julia are designed for “shallow” hierarchies of iterators. The Python versions of lazy iterators probably just stores the references to the sub-iterators while Julia stores their types, which become exponentially more complex. Therefore, the Julia versions just don’t work for this kind of deep recursion. It would probably be possible to write lazy iterators in Julia that would be analogous to Python by using non-concrete types for the iterators (which would reduce performance in more common use cases). However, they would probably not be very useful because the non-lazy approach shown by abraemer is likely faster in any case.

2 Likes

I still would call my approach lazy btw. It is just lazy on another level. In any case we have a lazily constructed graph that gets iterated and this appears to be the main point. There is no reason to also use lazy iterators on top of this.

Actually now that I write this: I did not implement the algorithm correctly. In my post above since I used pop! (this makes it depth-first actually). But it is easy to recover that by replacing pop! with popfirst! :slight_smile:

Can’t test this right now, but can’t that be partially done by replacing flatten(...) with Flatten{Any}(...)? The only method for it is flatten(itr) = Flatten(itr), which does Flatten{typeof(itr)}(itr) and thus would keep making new types as itr becomes new nested types. Still, the nested types themselves may trigger compilation of callees, so more abstract-ifying steps may need to be taken, like replacing elementwise-typed Tuples with Vector{Any}.

The obvious tradeoff of specializing over less specific types and parameters is lesser performance, but it’s probably worth it if the number of types can’t be controlled in any other way. CPython just happens to never specialize, so you get this behavior without any effort.

Aside: laziness is typically intended to save memory and work, but there can be a tipping point where it ironically takes more space and time because nesting structures take up memory in any language, and the work needed per iteration can pile up. Simple example, imagine a lazy iterator version of pop!(::Vector) that stores struct WillPop{T<:Vector} vec::T end; you do this in a loop and you accumulate WillPop{WillPop{WillPop{Vector{Int}}}} before collecting it which recursively excludes the trailing elements. Of course this doesn’t exist because eager pop! takes less work, and you can refactor to storing how many elements to remove before doing a mutating resize! or lazy view at once.

1 Like

Good idea, although you also need to do a similar thing to Iterators.peel. The following is no longer horribly slow, but still appears a bit slower than the Python version.

function Base.iterate(t::TraversalBFS{T}, state=((t.start,), Set{T}())) where {T}
        queue, visited = state                                                         
        dw = Iterators.dropwhile(w -> w ∈ visited, queue)
        dw === nothing && return nothing
        v, st = iterate(dw)
        to_visit = Iterators.Rest{Any, Any}(dw, st)
        push!(visited, v)                                                                   
        return v, (Iterators.Flatten{Any}((to_visit, t.graph.adj(v))), visited)
end

for v in t
        println(v)
end

Also regarding laziness - when using non-concrete types (as required in this approach), allocations usually start popping up. When that happens, it kind of defeats the point of using a lazy approach.

1 Like

Depends on your needs, I suppose. Runtime type checking and allocating boxes would also happen in the Python version, and any less-than-optimal program is fine if it runs fast enough or if the bottleneck is elsewhere. That’s why Python is so popular despite widespread acknowledgement of how little it can optimize. Compiling specialized optimized code is ideal, but when there’s too much to compile, some runtime work is preferable. Julia at least gives you the options, pure CPython just does fulltime runtime type checking and lets you tap into specialized code in other languages. CPython does at least try to mitigate how heap-heavy it is, like allocating pools for small objects and interning small integers and strings, even temporarily.

Thank you all for the suggestions!
Regarding lazyness, there’s a reason why I wrote the code like that: a vertex of a graph may have an infinite number of adjacent vertices. With my version, I can still iterate through the vertices of the graph (and if the algorithm was a depth first search, I could proceed exploring the graph in depth), while with the non-lazy (or partially-lazy :slight_smile: ) version, the iteration stops trying to collect all the infinite vertices. I remember that the adj function returns an iterator, that may be infinite, not always a tuple like in the above shown example.
The solution with the generic Any seems to work well, but it is about ten times slower than the python version, in the few tests I tried. Based on your suggestions, I have in mind another possible way to produce this lazy iterator, but I’m not sure if it could work nor if it would perform better than the one you already found. As soon as I have time I’ll try it and update you with my discoveries

Here is the version I had in mind:

struct IteratorQueue
    itrs::Vector{Any}
end

function Base.pop!(ic::IteratorQueue)
    isempty(ic.itrs) && return nothing
    it = ic.itrs[1]
    r = iterate(it)
    if r === nothing
        popfirst!(ic.itrs)
        return pop!(ic)
    end
    el, st = r
    ic.itrs[1] = Iterators.rest(it, st)
    el
end

function Base.push!(ic::IteratorQueue, it)
    push!(ic.itrs, it)
end


struct ImplicitGraph
    adj::Function
end

struct TraversalBFS{T}
    graph::ImplicitGraph
    start::T
end

G = ImplicitGraph(v -> (v - 1, v + 1))

t = TraversalBFS(G, 0)

function Base.iterate(t::TraversalBFS{T}, state=(IteratorQueue([(t.start,)]), Set{T}())) where {T}
    queue, visited = state
    v = pop!(queue)
    while v !== nothing
        v ∈ visited || break
        v = pop!(queue)
    end
    v === nothing && return nothing
    push!(visited, v)
    push!(queue, t.graph.adj(v))
    return v, (queue, visited)
end

Compared to the previous version, it seems that the state of the iterator doesn’t grow—in the version with Any, although it seems to run fast, the state keeps growing and growing at a fast pace (I don’t even understand how it could run fast).

This solution runs almost as fast as the python version. In my tests it was just about 1.3 times slower that the python one. Yet, it seems to require much more memory than the python version: you can check the memory consumption with your system profiler, after removing from the code the lines that update the dictionary, so that we exclude the growth of the dictionary from the consumed memory.
If you have any insights about this, or find some better way, I’d be glad to hear from you :slight_smile:

My solution could be made more readable and of course improved in performance (e.g., replacing recursion with iteration in the pop! method). As you can see, the idea of the solution is to build a queue of iterators, and to pop only the first element of the first iterator, without collecting anything else.

I am on mobile now so can’t test but just had another idea: I think we could adapt my “half-lazy” approach to the infinite vertex case. We just need to make the iteration state slightly more complex to have 3 parts: 1) the current iterator over adjacent nodes of the currently explored node, 2) a “todo” list of nodes to explore next, and 3) the seen nodes. So the iteration would check the current iterator for the next node that has not been explored/seen yet, adds it to the back of the todo list and the seen set, and returns it (together with the iteration state). If the iterator of the current node runs out, then we simply pop the firdt thing from the todo list and start iterating its adjacent nodes.

I’ll try to write some code that does that but remember I am on mobile and don’t even have syntax highlighting, so the code likely won’t run.

function Base.iterate(t::TraversalBFS{T}, state=(nothing,[t.start], Set{T}([t.start]))) where {T}
    queue, todo, seen = state
    while true
        while !isempty(queue)
            v = popfirst!(queue)
            v in seen && continue
            push!(seen, v)
            push!(todo, v)
            return v, (queue, todo, see)
        end
        isempty(todo) && return nothing
        queue = Iterators.Stateful(t.graph.adj(popfirst!(todo)))
    end
end
1 Like

A bottleneck in the current code is that the type of t.graph.adj is unknown and the resulting type instability causes lots of boxing, which is slow (about as slow as Python, which boxes everything).

One solution is to put this function into the iterator state, so it can be specialised on.

Edit: I’d only recommend this if you’re ok with paying some compilation price for each different adj function that you use.

1 Like

There should be a BFS implementation somewhere in the Julia ecosystem. Perhaps in the Graphs organization. Using an existing implementation may have many benefits.
In any case, here is another implementation which could answer the OP requirements:

using DataStructures

struct ImplicitGraph{F}
    adj::F
end

struct BFSterator{F}
    g::ImplicitGraph{F}
    root::Int
end

function Base.iterate(bfs::BFSterator)
    T = typeof(bfs.root)
    Q = Queue{T}()
    explored = Set{T}()
    enqueue!(Q, bfs.root)
    return iterate(bfs,(Q, explored))
end

function Base.iterate(bfs::BFSterator, state)
    Q, explored = state
    isempty(Q) && return nothing
    v = dequeue!(Q)
    push!(explored, v)
    for w in bfs.g.adj(v)
        w ∈ explored && continue
        enqueue!(Q, w)
    end
    return (v, (Q, explored))
end

With this implementatoin. and a function to return a tuple of neighbouring vertices:

julia> neighbours(v::Int) = v > 0 ? v < 10 ? (v-1, v+1) : (v-1,) : (v+1,)
neighbours (generic function with 1 method)

julia> for v in BFSterator(ImplicitGraph(neighbours),2)
       @show v
       end
v = 2
v = 1
v = 3
v = 0
v = 4
v = 5
v = 6
v = 7
v = 8
v = 9
v = 10

If you must put the above functions in 10-lines of code, it can also be done:

Base.iterate(bfs::BFSterator) = ( T = typeof(bfs.root);          
  iterate(bfs,(enqueue!(Queue{T}(), bfs.root), Set{T}())) )

function Base.iterate(bfs::BFSterator, (Q, explored))
    isempty(Q) && return nothing
    push!(explored, (v = dequeue!(Q);))
    foreach(Base.Fix1(enqueue!, Q), filter(!∈(explored), bfs.g.adj(v)))
    return (v, (Q, explored))
end
2 Likes

Thanks for your solution, I’ve learnt something about Julia reading it :smiley: I’m not using any already existing graph library because they all seem to deal with concrete graphs, that is graph whose set of vertices and nodes are known and already stored in memory. Your code has the problem that the iterator adj(v) is collected after the vertex v gets visited, and this is not fine for me because that iterator may be infinite.

Thanks, I’ll try to avoid boxing as you suggest, and let you know if I manage to see improvements

Thanks, this is very similar to my last solution, where I store in the state the iterators of adjacent vertices instead of the “Todo” vertices as you do. But your solution seems easier and probably more efficient. I’ll try it and let you know afterwards.

1 Like

Here I am again. So I rewrote the iterator using your suggestions, and this is what I got (I report the full code):

struct ImplicitGraph{F<:Function}
    adj::F
end

struct TraversalBFS{T,F}
    graph::ImplicitGraph{F}
    start::T
end


G = ImplicitGraph(v -> (v - 1, v + 1))

t = TraversalBFS(G, 0)

function Base.iterate(t::TraversalBFS{T,F}) where {T,F}
    state = (t.graph.adj(t.start), [t.start], Set([t.start]))
    t.start, state
end

function Base.iterate(t::TraversalBFS{T,F}, state) where {T,F}
    current, todo, seen = state
    while true
        while (next = iterate(current)) !== nothing
            v, s = next
            if v in seen
                current = Iterators.rest(current, s)
                continue
            end
            push!(seen, v)
            push!(todo, v)
            return v, (Iterators.rest(current, s), todo, seen)
        end
        isempty(todo) && return nothing
        current = t.graph.adj(popfirst!(todo))
    end
end

The code seems way faster than before. In fact it is almost as fast as directly traversing the graph without creating the iterator in the first place. Here are two test functions: in both function I compute the sum of the visited vertices, but in the first one I use the above shown iterator, and in the second one I directly traverse the graph.

function test_iterator(t, n)
    s = 0
    i = 0
    for v in t
        (i = i + 1) > n && break
        s += v
    end
    s
end


function test_direct(t, n)
    s = t.start
    i = 1
    visited = Set([t.start])
    queue = [t.start]
    while !isempty(queue)
        v = popfirst!(queue)
        for w in t.graph.adj(v)
            if w ∉ visited
                (i = i + 1) > n && break
                s += w
                push!(visited, w)
                push!(queue, w)
            end
        end
    end
    s
end

These are the timed executions of the two functions:

julia> @time test_iterator(t, 1000000)
  0.286429 seconds (3.00 M allocations: 156.557 MiB, 4.01% gc time)
-500000

@time test_direct(t, 1000000)
  0.144775 seconds (51 allocations: 34.503 MiB)
-500000

As you can see, the execution time of the iterator is quite satisfying because it is just twice the desirable execution time, but the memory usage is bad, because there are many allocations that are not present in the direct approach. Do you have some idea how to solve this problem? Is there a way to avoid those memory allocations?

The allocations come from allocating the return tuple of the iterator. Thus every iteration gets an allocation. This should be avoidable but I haven’t managed it yet.

To avoid the allocations from the return tuple, I think you just need to slap @inline infront of your iterate methods (cc @Dan ). I think the other allocations stem from the usage of Iterators.rest. Maybe one could avoid these by keeping track of the iteration state manually.

2 Likes

Wow, this is what happened after inlining the iterate functions:

julia> @time test_iterator(t, 1000000)
  0.222239 seconds (51 allocations: 34.503 MiB, 2.02% gc time)
-500000

The memory usage is exactly the same as in the direct case, so I think we reached the limit and we cannot do any better. At this point there is no drawback at all in using the iterator compared to using the direct approach, thank you :slight_smile:

Here are some more benchmarks.

julia> using BenchmarkTools

julia> @btime test_direct(t, 1)
  234.538 ns (6 allocations: 528 bytes)
0

julia> @btime test_iterator(t, 1)
  286.349 ns (7 allocations: 608 bytes)
0

julia> @btime test_direct(t, 10)
  628.488 ns (7 allocations: 608 bytes)
-5

julia> @btime test_iterator(t, 10)
  933.333 ns (10 allocations: 1.33 KiB)
-5

julia> @btime test_direct(t, 100)
  5.617 μs (13 allocations: 3.83 KiB)
-50

julia> @btime test_iterator(t, 100)
  6.940 μs (13 allocations: 3.83 KiB)
-50

julia> @btime test_direct(t, 1000)
  61.800 μs (20 allocations: 49.34 KiB)
-500

julia> @btime test_iterator(t, 1000)
  73.200 μs (20 allocations: 49.34 KiB)
-500

julia> @btime test_direct(t, 10000)
  606.100 μs (26 allocations: 193.56 KiB)
-5000

julia> @btime test_iterator(t, 10000)
  719.200 μs (26 allocations: 193.56 KiB)
-5000

As you can see, the memory behaviour is the same, the time of the iteration algorith is only slightly worse, but this is completely reasonable.

1 Like