PartitionBy, retaining key

Is there something like Python’s itertools.groupby, which is like Julia’s Transducers.PartitionBy but yields (partition_key, partition_entries) pairs instead of only partition entries?

cc @tkf

Hmm… good point. I don’t remember why I didn’t pass along the key to the downstream transdcuers.

It is actually possible to write this, though:

julia> using Transducers, MicroCollections

julia> [1, 3, 2, 4, 3, 5] |>
       Map(x -> (isodd(x), x)) |>
       ReducePartitionBy(
           first,
           TeeRF(Map(first)'(right), Map(SingletonVector ∘ last)'(Completing(append!!))),
       ) |>
       collect
3-element Vector{Tuple{Bool, Vector{Int64}}}:
 (1, [1, 3])
 (0, [2, 4])
 (1, [3, 5])

(which is, BTW, parallelizable while PartitionBy is not)

OK, but arguably this is rather hairy to write.

Maybe it’d be better to wrap it in something like

reduced_partition_and_key(f, rf = Map(SingletonVector)'(Completing(append!!))) =
    Map(x -> (f(x), x)) |>
    ReducePartitionBy(
        first,
        TeeRF(Map(first)'(right), Map(last)'(rf)),
    )

so that

julia> [1, 3, 2, 4, 3, 5] |> reduced_partition_and_key(isodd) |> collect
3-element Vector{Tuple{Bool, Vector{Int64}}}:
 (1, [1, 3])
 (0, [2, 4])
 (1, [3, 5])

julia> [1, 3, 2, 4, 3, 5] |> reduced_partition_and_key(isodd, +) |> collect
3-element Vector{Tuple{Bool, Int64}}:
 (1, 4)
 (0, 6)
 (1, 8)

(The second example fuses in-partition reduction and avoids allocation of the inner vectors.)

I don’t see why Unique() fails here. Replacing Unique() |> collect with collect |> unique works fine.

using Transducers, MicroCollections


reduced_partition_and_key(f, rf = Map(SingletonVector)'(Completing(append!!))) =
    Map(x -> (f(x), x)) |>
    ReducePartitionBy(
        first,
        TeeRF(Map(first)'(right), Map(last)'(rf)),
    )

charstrings = string.(collect("a123bc34d8ef34"))



charstrings |>
    reduced_partition_and_key(x->isnothing(tryparse(Int, x)), *) |>
    Filter(==(0) ∘ first) |>
    Map(x->parse(Int, x[2])) |>
    Unique() |>
    collect



ERROR: LoadError: MethodError: no method matching unwrap(::Transducers.Reduction{Unique{typeof(identity)},Transducers.Reduction{Map{Type{BangBang.NoBang.SingletonVector}},Transducers.BottomRF{Transducers.AdHocRF{typeof(BangBang.collector),typeof(identity),typeof(append!!),typeof(identity),Nothing}}}}, ::Tuple{Bool,String})

Unfortunately, stateful transdcuers like Unique cannot be used after parallelizable transdcuer like ReducePartitonBy. It’s kind of a cost of parallelizability. There can be a better design to allow this but it’s a bit tricky to do ATM.

(Though the unwrap method error is actually a bug. Thanks for sharing the code!)

Meanwhile, I think the easiest approach might be to just cook up your partitionby using FGenerators:

julia> using FGenerators

julia> @fgenerator function partitionby(f, xs)
           buffer = eltype(xs)[]
           key = f(first(xs))
           for x in xs
               y = f(x)
               if !isequal(y, key)
                   @yield key => buffer
                   empty!(buffer)
                   key = y
               end
               push!(buffer, x)
           end
       end
partitionby (generic function with 1 method)

julia> partitionby(x->isnothing(tryparse(Int, x)), charstrings) |>
           Map(((k, v),) -> (k, prod(v))) |>
           Filter(==(0) ∘ first) |>
           Map(x->parse(Int, x[2])) |>
           Unique() |>
           collect
3-element Vector{Int64}:
 123
  34
   8

Note: xs -> partitionby(f, xs) is not a transducer so pre-processing of xs cannot be done with transdcuer.

2 Likes

It just occurred to me that you’d need isempty(buffer) || @yield key => buffer at the end.