I implemented a type for defining arbitrary lazy sequences that performs well, but the implementation is so trivial that I'm not sure if it's worth making a package

… so I thought I’d post it here and see what people thought.

Lazy sequences have been a point of frustration for me with Julia. Python is my main background, and I always liked generator functions, and I really enjoy playing with lazy types in various functional languages.

In Julia, channels can be used in a way similar to generator functions, but the performance is bad. Thunk-based solutions like Lazy.jl are super cool, but performance there is also problematic. Making an efficient lazy iterator in Julia requires at least defining a type and an iterate method for it—as well as Base.eltype and Base.IteratorSize for most practical cases. This seems boiler-plate-y. Required infinite Fibonacci example:

struct Fibs end
Base.iterate(f::Fibs) = iterate(f, (0, 1))
Base.iterate(::Fibs, (a, b)) = a, (b, a+b)
Base.IteratorSize(::Fibs) = Base.IsInfinite()
Base.eltype(::Fibs) = Int

foreach(println, Fibs())

This is actually fast!

My thought was to make a type that generalized this. This is what I came up with:

struct Sequence{F <: Function, I}
    f::F
    init::I
end

Base.iterate(s::Sequence) = iterate(s, s.init)
Base.iterate(s::Sequence, state) = s.f(state)
Base.IteratorSize(::Sequence) = Base.SizeUnknown()

Now, your infinite fibonacci sequence looks like this:

fibs = Sequence((0, 1)) do (a, b)
    a, (b, a+b)
end

And it is just as fast as the other method.

If you want a sequence that terminates, it looks like this:

counter(stop) = Sequence(1) do n
    n > stop ? nothing : (n, n+1)
end

This works well, but it’s still tricky to “collect” because it lacks an implementation for eltype. You can either give a type explicitly to the collect function, or you can use something like this:

_get_types(s::Sequence) =
    Base.return_types(s.f, (typeof(s.state),))[1] |> _get_types
_get_types(::Type{Union{Nothing, S}}) where S = _get_types(S)
_get_types(::Type{Tuple{El,State}}) where {El,State} = (el=El, state=State)
Base.eltype(s::Sequence) = _get_types(s).el

Note that this is an imperfect heuristic. It can fail if the type of the state can change in such a way that will change the return types. This seems to be pretty rare in practice.

Is anyone else interested in this kind of thing? Would it be worth packaging it up and distributing it?

Edit: @tkf explained in a later post a better way to implement eltype:

Base.IteratorEltype(::Sequence) = Base.EltypeUnknown()
7 Likes

I think so but it seems opening PR somewhere will be better. See this topic, it looks similar: Julia equivalence of Haskells unfoldr

2 Likes
1 Like

ResumableFunctions are also relatively slow–though much faster than using a Channel.

Is anyone else interested in this kind of thing? Would it be worth packaging it up and distributing it?

I’ve been using something like this quite often, but I never really thought of putting it into a module, since it’s really not a lot of code.

1 Like

super cool type, it is in the spirit of GitHub - francescoalemanno/KernelOps.jl: Apply lazy kernel operations on AbstractArrays, they are composable and non copying. Have Fun!

1 Like

I guess it’s more or less IterTools.iterated (with a post-hoc fix up using imap)?

I think the common consensus is using return_type(s) like this is wrong (i.e., changes in the compiler heuristics should not affect your API). I think a better approach is to define Base.IteratorEltype to be Base.EltypeUnknown(). collect should be able to handle it by the type-widening strategy.

6 Likes

Thanks for insight. Seems to work well!

I’ll check in with IterTools and see if they are interested in a type like this, as others have suggested.

Well come on then! Show us the benchmarks

2 Likes

Good idea. Here’s the benchmark code:

#!/usr/bin/env julia
using BenchmarkTools
using Lazy
using ResumableFunctions
using Sequence

channel_counter(start::I, stop::I) where I<:Integer =
    Channel{I}(1) do chan
        foreach(n->put!(chan, n), start:stop)
    end

lazy_counter(start) = @lazy start : lazy_counter(start + 1)
lazy_counter(start, stop) =
    takeuntil(n -> n > stop, lazy_counter(start))

@resumable function resumable_counter(start, stop)
    for n in start:stop
        @yield n
    end
end

sequence_counter(start, stop) = Sequence.new(start) do n
    n > stop ? nothing : (n, n+1)
end

struct CustomCounter{I <: Integer}
    start::I
    stop::I
end
Base.iterate(cc::CustomCounter) = Base.iterate(cc, cc.start)
Base.iterate(cc::CustomCounter, n) = n > cc.stop ? nothing : (n, n+1)


function main()
    println(:channel)
    @btime foldl(+, channel_counter(1, 100))
    println(:lazy)
    @btime foldl(+, lazy_counter(1, 100))
    println(:resumable)
    @btime foldl(+, resumable_counter(1, 100))
    println(:sequence)
    @btime foldl(+, sequence_counter(1, 100))
    println(:custom)
    @btime foldl(+, CustomCounter(1, 100))
end

main()

Here is the output:

channel
  388.460 μs (229 allocations: 4.92 KiB)
lazy
  174.170 μs (1589 allocations: 32.85 KiB)
resumable
  17.957 μs (373 allocations: 10.59 KiB)
sequence
  2.210 ns (0 allocations: 0 bytes)
custom
  2.210 ns (0 allocations: 0 bytes)

So, it’s like 9,000 8,000 times faster than resumable functions and the same speed as implementing a custom type.

3 Likes

2 ns for a loop of 100 cannot be right - the compiler probably just precomputes the result and returns it. For me any number of iterations give the same timing, e.g.

using BenchmarkTools

struct CustomCounter{I <: Integer}
    start::I
    stop::I
end
Base.iterate(cc::CustomCounter) = Base.iterate(cc, cc.start)
Base.iterate(cc::CustomCounter, n) = n > cc.stop ? nothing : (n, n+1)


function main()
    @btime foldl(+, CustomCounter(1, 1000000))
end

main()

shows 1.2 ns. A way to more properly benchmark it is to put the number of steps into a variable:

function main()
    println(:inlined_100)
    @btime foldl(+, CustomCounter(1, 100))
    println(:inlined_1000000)
    @btime foldl(+, CustomCounter(1, 1000000))
    println(:var_100)
    n = 100
    @btime foldl(+, CustomCounter(1, $n))
    println(:var_1000000)
    n = 1000000
    @btime foldl(+, CustomCounter(1, $n))
end

main()

prints


inlined_100
  1.267 ns (0 allocations: 0 bytes)
inlined_1000000
  1.267 ns (0 allocations: 0 bytes)
var_100
  31.864 ns (0 allocations: 0 bytes)
var_1000000
  250.600 μs (0 allocations: 0 bytes)
4 Likes

Actually, it’s probably using the closed form solution for a sum of sequential integers (if they’re non-negative, it can be computed efficiently with (n * (n + 1)) >>> 1).
LLVM often makes that transformation, and @code_llvm confirms.

function sumints(N)
    s = zero(N)
    for n ∈ one(N):N
        s += n
    end
    s
end

This yields:

#julia> @code_llvm debuginfo=:none sumints(100)

define i64 @julia_sumints_2688(i64) {
top:
  %1 = icmp sgt i64 %0, 0
  %2 = select i1 %1, i64 %0, i64 0
  br i1 %1, label %L12.preheader, label %L27

L12.preheader:                                    ; preds = %top
  %3 = shl nuw i64 %2, 1
  %4 = add nsw i64 %2, -1
  %5 = zext i64 %4 to i65
  %6 = add nsw i64 %2, -2
  %7 = zext i64 %6 to i65
  %8 = mul i65 %5, %7
  %9 = lshr i65 %8, 1
  %10 = trunc i65 %9 to i64
  %11 = add i64 %3, %10
  %12 = add i64 %11, -1
  br label %L27

L27:                                              ; preds = %L12.preheader, %top
  %value_phi9 = phi i64 [ 0, %top ], [ %12, %L12.preheader ]
  ret i64 %value_phi9
}

This isn’t as clean as (n * (n + 1)) >>> 1, but you can see that it does not including any loops (L12.preheader: cannot jump back to itself, but only to the end, br label %L27), and it does include simple arithmetic and shifts.

That ninjaaron’s approach is generating simple enough code for LLVM to figure this out shows that it is pretty friendly to the compiler.

3 Likes

Thanks for pointing this out! I reran the benchmark with start and stop as variables in each case which definitely made an impact! Here is the main function after the changes:

function main()
    start, stop = 1, 100
    println(:channel)
    @btime foldl(+, channel_counter($start, $stop))
    println(:lazy)
    @btime foldl(+, lazy_counter($start, $stop))
    println(:resumable)
    @btime foldl(+, resumable_counter($start, $stop))
    println(:sequence)
    @btime foldl(+, sequence_counter($start, $stop))
    println(:custom)
    @btime foldl(+, CustomCounter($start, $stop))
end

and here is the output:

channel
  232.484 μs (229 allocations: 4.92 KiB)
lazy
  109.011 μs (1590 allocations: 32.87 KiB)
resumable
  11.329 μs (373 allocations: 10.59 KiB)
sequence
  33.356 ns (0 allocations: 0 bytes)
custom
  31.860 ns (0 allocations: 0 bytes)

Interestingly, the slower benchmarks improved with this change and the faster benchmarks got slower. Now, Sequence only 340 times faster than ResumableFunctions.

While that is a huge change from before and what you suggested is definitely a better benchmark, I would point out the fact that Sequence is susceptible to this kind of optimization is also significant.

3 Likes

Where is the Sequence module defined?

I posted most of the code in the OP, but here’s exactly what it is in the benchmarks I posted—it incorporates the change to the implementation of eltype suggested by @tkf :

module Sequence

struct T{F <: Function, S}
    f::F
    state::S
end
new(f, s) = T(f, s)

Base.iterate(s::T) = iterate(s, s.state)
Base.iterate(s::T, state) = s.f(state)
Base.IteratorSize(::T) = Base.SizeUnknown()
Base.IteratorEltype(::T) = Base.EltypeUnknown()

end # module

I also have some implementations of map, filter, take, takewhile and a few others, but they are unnecessary, since this composes fine with the functions in Base.Iterators.

Apologies in advance for naming the struct T. I’ve been writing a lot of OCaml, which has had an impact on my coding style. The upshot is that I got the idea for this when I noticed the similarities between the iteration protocol in Julia and the API of Base.Sequence in OCaml (from Jane Street).

4 Likes

Thanks, I was missing new. I’d have to look in this more to find out why, but the code is generating a loop for both sequence_counter and CustomCounter.

Is that bad?

FYI, here is how you implement it as a “foldable” (a counter-part of “iterator” in Transducers.jl; see Transducers.AdHocFoldable):

using Transducers
using Transducers: @next, complete

counter_foldable(start, stop) =
    AdHocFoldable() do rf, acc, _
        i = start
        acc = @next(rf, acc, i)
        while i < stop
            i += 1
            acc = @next(rf, acc, i)
        end
        return complete(rf, acc)
    end

Benchmark code:

function main()
    println(:var_100)
    start = 1
    stop = 100
    @btime foldl(+, CustomCounter($start, $stop))
    @btime foldl(+, counter_foldable($start, $stop))
    println(:var_1000000)
    stop = 1000000
    @btime foldl(+, CustomCounter($start, $stop))
    @btime foldl(+, counter_foldable($start, $stop))
end

Output:

var_100
  60.414 ns (0 allocations: 0 bytes)
  2.284 ns (0 allocations: 0 bytes)
var_1000000
  527.555 μs (0 allocations: 0 bytes)
  2.285 ns (0 allocations: 0 bytes)

As you can see, it looks like Julia+LLVM figured out closed-form solution even with the $ hack. I don’t know how to workaround this (I tried start = Ref(1) instead but it didn’t work).

Another example:

fib_foldable() =
    AdHocFoldable() do rf, acc, _
        a, b = 0, 1
        while true
            acc = @next(rf, acc, a)
            a, b = (b, a + b)
        end
        return complete(rf, acc)
    end

julia> collect(Take(10), fib_foldable())
10-element Array{Int64,1}:
  0
  1
  1
  2
  3
  5
  8
 13
 21
 34

As you can see, you can kind of think @next as yield in Python.

3 Likes

Sequence (and presumably a type-based counter) seem to get into a similar optimization situation when implemented in a way that is type-stable:

_infinite_counter(start) = Sequence.new(n->(n, n+1), start)
type_stable_sequence_counter(start, stop) =
    take(_infinite_counter(start), stop-(start-1))

I add this and your AdHocFoldable implementation to the benchmarks and got this:

stop = 100
channel
  232.768 μs (229 allocations: 4.92 KiB)
lazy
  106.801 μs (1590 allocations: 32.87 KiB)
resumable
  11.672 μs (373 allocations: 10.59 KiB)
sequence
  33.358 ns (0 allocations: 0 bytes)
custom
  31.859 ns (0 allocations: 0 bytes)
type_stable_sequence_counter
  3.812 ns (0 allocations: 0 bytes)
adhoc_foldable_counter
  1.901 ns (0 allocations: 0 bytes)

stop = 1,000,000
type_stable_sequence_counter
  3.817 ns (0 allocations: 0 bytes)
adhoc_foldable_counter
  2.213 ns (0 allocations: 0 bytes)

It’s a bit strange that the iteration protocol is inherently non-type-stable when it has a terminal case (type is Union{Nothing,Tuple{El,State}} where {El, State}).

It could probably be improved—at least in terms of performance—by having an returning a type-stable tuple of (el, state, completion_status) where completion_status is a Boolean. I wonder…

To get around the automatic optimization, why not simply sum a series of trigonometric functions?