ANN: Transducers.jl, efficient and composable algorithms for map- and reduce-like operations

#21

Thank you, I appreciate your feedback!

Yes, Transducers.jl’s collect uses inference API. Also, there are many other places (e.g., transducers like Partition which needs an internal buffer) that relies on type inference.

Great question! I’ve been pondering about how to avoid relying on the type inference. Note that, especially in transducers, this problem also applies when the output is not a container. For example, consider mapfoldl(Filter(isodd) |> Map(x -> x / 2), +, 1:10). It requires to choose a “good” initial zero unless you provide it with init keyword argument. Furthermore, the first resulting (mapped) item cannot be used as the initial guess since the first item may be filtered out. The problem you mention corresponds to the case you use push! instead of + (roughly speaking).

The direction I want to explore to solve it is to define “generic” left-identity type like

struct Identity end
Base.*(::Identity, x) = x
Base.+(::Identity, x) = x
# and so on...

and then use Identity() as the default init. (I am actually thinking to use struct Identity{OP} end with Base.*(::Identity{*}, x) = x to avoid confusion/bug due to miss-use of an identity for one reducing function with another reducing function. Also, above implementation like *(::Identity, x) can probably be implemented as a transducer.) It would require each foldl implementation to embrace widen-as-needed approach. This is because the accumulation variable can change type from Identity to, e.g., Float64 or Vector{Int} at any point. Also, this change of type can occur multiple times (e.g., IdentityVector{Int}Vector{Union{Missing,Int}}). It means that you need to write a somewhat non-trivial foldl for each container type (or any “reducible”). Although it is a downside in some sense, the upside is that you only need to write it once (well, if this plan works…).

Once this infrastructure is in place, I think widening-based collect can be implemented by just using

push_or_widen!(::Identity, x) = [x]
push_or_widen!(xs::Vector{T}, x::S) where {T, S <: T} = push!(xs, x)
push_or_widen!(xs::Vector, x) = push!(append!(Union{eltype(xs), typeof(x)}[], xs), x)

instead of push!. (Actually, you can already feed this function to mapfoldl. The only problem is that it is not going to be type-stable.)

I think Julia handles it nicely since it is a dynamic language. That is to say, the worst case scenario is the dynamic dispatch which should still result in a logically correct answer. One of the examples is the prime sieve. This is just a for-fun demo (don’t use it if you need speed!) but it does show that Julia can handle “type explosion.”

But I don’t know if it is possible to optimize such use-case unless the input container is a small container like tuples that can be completely unrolled.

For “sink” containers, I think it would be relatively easy as I think you only need to implement two methods:

push_or_widen!(xs::MyContainer{T}, x::S) where {T, S <: T} = push!(xs, x)
push_or_widen!(xs::MyContainer, x) = push!(append!(MyContainer{Union{eltype(xs), typeof(x)}}(), xs), x)

All the hard part would be in the foldl implementation for the “source” reducible container (as I mentioned in the answer to the first question).

6 Likes
#22

Very interesting work! I’m especially curious about this as IndexedTables and StructArrays implement many tabular algorithms (groupreduce, groupby, filter, in the future join) as returning iterators which are then collected using collect_structarray which is a variant of collect in Base (it uses the same widening strategy) that collects an iterator of structs into a struct of arrays. The main drawback of this approach so far is that in some scenarios iterators lack performance: for example collect_structarray(flatten(itr)) performs quite poorly by default so we end up defining a collect_columns_flattened. I’d be happy to try and add a method to collect_structarray that works with transducers.

There are two concerns that I see with using transducers. First of all (as mentioned above), it’s important that collect doesn’t rely on type inference. What I don’t understand is whether the new solution you have in mind will have the same performance as the previous one in case the transducer being collected is actually type stable. Maybe the only thing that’s needed is to separate the first step of the foldl (until you get [x]) from the rest of the iteration with a function boundary (that’s StructArrays as well as Base.collect strategy AFAIU).

The second concern is that, while I think it’s natural to return an iterator, if the user calls the “lazy version” of join or groupby, it’d be a bit risky to return a transducer as the user would then be forced to use the transducers technology to do further work on it (which not everybody is familiar with, as it is very new). Even though it breaks all the nice composability, is it possible to somehow turn a (Transducer, Iterator) pair into a new iterator? This way the API could be that the user can get as an output of a tabular operation either the collected version (eager mode) or the (transducer, iterator) pair (advanced user that would know how to work with them) or simply an iterator (normal lazy mode).

3 Likes
#23

Thanks, I appreciate your comments!

I think it’s possible but that’s something I can’t be sure until I implement it. Also, I think more tricky question is that if it the implementation is easier than Base.iterate protocol. I like that current foldl implementations are quite straightforward comparing to Base.iterate. But I’m not entirely sure if I can keep this feature after implementing the new solution.

Yes, it’s possible and already implemented. It’s called eduction. An eduction implements both foldl and iterator protocols. So, you can pass it to a function that expects a plain iterator. However, as those protocols are quite different, eduction is not super efficient, at least at the moment. I have some optimization ideas but I’m not sure if it can be comparable to plain iterators. (Edit: I suspect optimization is impossible for “expansive” cases like Cat/flatten. The optimization I have in mind is for “contraction” cases like Filter.)

3 Likes
#24

Hopefully the same trick used here in StructArrays (and as far as I remember in Base.collect) can work. There we basically check, inside the while loop, isa(res, current_type) and if it’s true we go on as planned and if it’s false we call the same function recursively (in this case collect_to_structarray!) with a widened value. The idea is that in the type stable case the if condition is known at compile time so it actually doesn’t impact performance. You only need to make sure that the arguments of the function where this happens are sufficient to “resume” things. For example here you would need to also have the state of the iterator as a function argument.

Thanks! That’s exactly what I was looking for!

2 Likes
#25

Cool. Good to know this trick works. This was something I had in mind when I wrote the plan with Identity thing.