Hi,
I have several cases where I need just the first few elements that match a condition. With transducers, I can do something like
big_array |> Filter(expensive_filter) |> Map(expensive_transformation) |> Take(5) |> collect
and it will evaluate my expensive functions the minimum number of times to get 5 passing elements. Awesome!
I’m trying to do something similar with regex. Here’s a working, non-transducer example:
function get_regex_matches(str, pattern, cache, num_matches)
j = 1
for i in 1:num_matches
m = findnext(pattern, str, j)
cache[i] = SubString(str, m.start, m.stop)
j = m.stop
end
end
I double-checked that this approach is actually faster than using eachmatch
for my use case:
function get_regex_matches_eachmatch(str, pattern, cache, num_matches)
for (i, m) in enumerate(eachmatch(pattern, str))
if i > num_matches
break
end
cache[i] = m.match
end
end
sample_string = "np.where(0.03376436965056531*df['rolling_zscore(sma(high_low_mean,12),90)']+ 0.05938041870616624*df['rolling_zscore(slope(slope(slope(24),24),12),90)'] - 0.05545017417059229*df['rolling_zscore(slope(slope(slope(12),12),12),90)'] + -0.02641195019844662 > 0, 1.0, "
cache = Array{String}(undef, 4)
const feature_pat = Regex(raw"\[.*?\]")
@btime get_regex_matches_eachmatch(sample_string, feature_pat, cache, 2)
1.085 μs (13 allocations: 1.11 KiB)
id = x -> x
@btime get_regex_matches(sample_string, feature_pat, cache, id, 2)
617.279 ns (5 allocations: 240 bytes)
However, this loses the benefits of the transducer approach, especially composability. Is there a way to do a transducer-like or lazy filter that goes through regex matches of a string?