Transducer Filtering with Regex?

Satvik · December 27, 2020, 4:26am

Hi,

I have several cases where I need just the first few elements that match a condition. With transducers, I can do something like

big_array |> Filter(expensive_filter) |> Map(expensive_transformation) |> Take(5) |> collect

and it will evaluate my expensive functions the minimum number of times to get 5 passing elements. Awesome!

I’m trying to do something similar with regex. Here’s a working, non-transducer example:

function get_regex_matches(str, pattern, cache, num_matches)
    j = 1
    for i in 1:num_matches
        m = findnext(pattern, str, j)
        cache[i] = SubString(str, m.start, m.stop)
        j = m.stop
    end
end

I double-checked that this approach is actually faster than using eachmatch for my use case:

function get_regex_matches_eachmatch(str, pattern, cache, num_matches)
    for (i, m) in enumerate(eachmatch(pattern, str))
        if i > num_matches
            break
        end
        cache[i] = m.match
    end
end

sample_string = "np.where(0.03376436965056531*df['rolling_zscore(sma(high_low_mean,12),90)']+ 0.05938041870616624*df['rolling_zscore(slope(slope(slope(24),24),12),90)'] - 0.05545017417059229*df['rolling_zscore(slope(slope(slope(12),12),12),90)']  + -0.02641195019844662 > 0, 1.0, "
cache = Array{String}(undef, 4)
const feature_pat = Regex(raw"\[.*?\]")

@btime get_regex_matches_eachmatch(sample_string, feature_pat, cache, 2)
  1.085 μs (13 allocations: 1.11 KiB)
id = x -> x
@btime get_regex_matches(sample_string, feature_pat, cache, id, 2)
  617.279 ns (5 allocations: 240 bytes)

However, this loses the benefits of the transducer approach, especially composability. Is there a way to do a transducer-like or lazy filter that goes through regex matches of a string?