How to lazily filter an iterator while also applying a function to it

jeffreyesun · March 12, 2020, 9:45pm

Hi,

I’m trying to figure out how to do the following: given an iterator A, create a generator that goes through each element v of A, and either yields something as a function of v or simply moves on to the next element v' without yielding anything at all. Consider the following example:

A = 1:100
function flt(v)
    if v%3 == 1
        return DoNotYield
    end
    v ^= 2
    if v%5 == 1
        return DoNotYield
    end
    return v
end

I want some kind of function filtermap such that filtermap(flt, A) returns a generators that yields n^2 for every number n between 1 and 100 such that (1.) n \not\equiv 1 (mod 3) and (2.) n^2 \not\equiv 1 (mod 3).

My actual use case is that that I have a bunch of data, and the function I have to apply to each data point is slow. It’s important that at any point in the function I want to be able to decide that I don’t want this data point, and so don’t yield it. This is why I can’t simply do Iterators.filter(flt, f(a) for a=A) or (f(a) for a=Iterators.filter(flt,A)).

An implementation of the desired function is not hard:

struct DoNotYield;end
function filtermap(flt, itr)
    Iterators.filter(v->v!=DoNotYield, flt(v) for v=itr)
end

but I wonder if there’s a built-in, and hence more standard/better optimized way to do it.

Does anyone know how this is “supposed” to be done? Thanks.

EDIT: The example I gave is misleading because it makes it seem like checking the condition for each item is fast. My real filter function looks more like this:

function flt(v)
    cond1(v) && return DoNotYield
    v = costly_computation1(v)
    cond2(v) && return DoNotYield
    v = costly_computation2(v)
    cond3(v) && return DoNotYield
    return v
end

ericphanson · March 12, 2020, 9:49pm

I’m not sure if this is quite what you’re looking for, but the built-in generator syntax can handle this kind of thing:

julia> gen = ( n^2 for n in 1:100 if ( mod(n,3) != 1 && mod(n^2,3) != 1) )
Base.Generator{Base.Iterators.Filter{var"#8#10",UnitRange{Int64}},var"#7#9"}(var"#7#9"(), Base.Iterators.Filter{var"#8#10",UnitRange{Int64}}(var"#8#10"(), 1:100))

julia> for g in gen
           println(g)
       end
9
36
81
144
225
324
441
576
729
900
1089
1296
1521
1764
2025
2304
2601
2916
3249
3600
3969
4356
4761
5184
5625
6084
6561
7056
7569
8100
8649
9216
9801

anon94023334 · March 12, 2020, 9:51pm

why not

(n^2 for n in Iterators.filter(x->(x%3 != 1 || x%5 !=1), -5:5 ))

?

jeffreyesun · March 12, 2020, 9:51pm

Sorry for being unclear. The example is maybe a poor one. The problem I have is that processing each element of the iterator is costly, and I want the ability to, at any point in processing it, decide that I don’t want to yield anything for it.

anon94023334 · March 12, 2020, 9:55pm

Ok, how about something along these lines?

function expensive(n)
         return rand() > 0.5 ? nothing : n
end

(n^2 for n in Iterators.filter(!isnothing, expensive.(1:10)))

ericphanson · March 12, 2020, 9:58pm

I think the expensive function is where the n^2 is, but wants to somehow control the filtering step. I don’t really see how to do it with generators and filters in one pass then though.

I’d probably just make a vector of results by push!ing to it whenever I get a datapoint I want to include. Seems like a simple 1-pass way to do it, though I guess the format of the results you want depends on the later steps (maybe you don’t want a vector).

edit: I guess if eveything is lazy then you can compose as you wish and it isn’t 2 passes through the data.

anon94023334 · March 12, 2020, 9:59pm

Oh, if the n^2 is expensive, I’d just make that decision inside of expensive() and not worry about mapping. Just filter on !isnothing (or use a sentinel for type stability) and you’re all set.

jeffreyesun · March 12, 2020, 9:59pm

This is how I was doing things originally, but my dataset is too big for me to store the whole vector of results. I need to be able to do the evaluation lazily to avoid running out of memory.

jeffreyesun · March 12, 2020, 10:00pm

So something like this?

Iterators.filter(!isnothing, flt(v) for v=itr)

where flt in this case is

function flt(v)
    cond1(v) && return nothing
    v = costly_computation1(v)
    cond2(v) && return nothing
    v = costly_computation2(v)
    cond3(v) && return nothing
    return v
end

This seems like it would work! Thank you!

anon94023334 · March 12, 2020, 10:01pm

yup. It’s not type-stable, but if you have a sentinel value of type v you could use as the filter condition, that’d work too.

jeffreyesun · March 12, 2020, 10:03pm

I don’t think that’s possible, as my type is CSV.Row2, which doesn’t seem to be easily constructable without reading from an actual file. Do you think that the type-instability matters a lot for performance?

It shouldn’t, right? All the expensive stuff is happening within expensive.

anon94023334 · March 12, 2020, 10:07pm

it shouldn’t be too bad. It’s a Union{nothing, typeof(v)}, which seems better to me than Any, but I don’t know.

If you’re using CSV.Row2, what about an empty row? You’d have to make sure you’re not actually expecting this as a result.

CSV.Row2(Vector{Symbol}(), Dict{Symbol, Int64}(), Vector{UInt64}(), Vector{UInt8}(), 0x0, 0)

I’d try a comparison to see what performance hit the type-unstable gives you, if any, and prefer that over the sentinel approach if it’s not too bad.

jeffreyesun · March 12, 2020, 10:10pm

Seems reasonable. For mostly aesthetic reasons, I might just go with nothing after checking to see if it makes a performance difference.

anon94023334 · March 12, 2020, 10:10pm

that’s what I was trying (poorly) to say

tim.holy · March 13, 2020, 7:08am

It usually is. See Union-splitting: what it is, and why you should care

anon94023334 · March 13, 2020, 6:39pm

That was a great read. Thank you!

jeffreyesun · March 13, 2020, 9:30pm

Quick question about this actually: in this context does one need to annotate the function with the union type:

function expensive(...)::Union{CSV.Row2, Nothing}
    ...
end

or is this handled automatically somehow (as long as all the parameters and internal variables are properly annotated)?

tim.holy · March 14, 2020, 9:19am

Typically you don’t need to annotate anything at all; in well-written code, Julia will figure out all the types and do it for you. Worth through the example at https://docs.julialang.org/en/latest/manual/performance-tips/#man-code-warntype-1 and you’ll see this in action. It looks like most of that section was written before we had union-splitting, so it could use a rewrite, but at least the core principles remain relevant even if the performance consequences have been greatly mitigated.

The only common cases where you need to annotate types are (1) the fields of a struct that you create and (2) when you’re trying to control dispatch (control which methods get called for which object types). Annotating the arguments of functions has no performance impact, it’s just to control dispatch. There are also specific circumstances where Julia’s inference doesn’t/can’t possibly work (e.g., returning either an Int or a Float64 based on reading data from a file, since Julia can’t possibly predict the contents of an arbitrary file at the time the code is compiled) where adding ::Union{A,B} might help, but you should use @code_warntype to detect such cases rather than assuming that you should annotate everything (which typically leads to inflexible code).

Keep in mind splitting happens only with “small” Unions (IIRC 3 or fewer concrete types), and you also can’t have too many “uncorrelated” unions (where you end up getting the outer product over all possible types) in the same code.

jeffreyesun · March 18, 2020, 4:36pm

Huh, I had no idea that type assertions on function arguments didn’t improve performance! I seem to remember that sometimes a statement like

@assert myvec .>= 0

can speed up following code. Is this still true? I can’t seem to find any documentation of “places where @assert can improve performance.”

mcabbott · March 18, 2020, 4:51pm

I think such assertions are simply a shorthand form for an if statement:

julia> @macroexpand @assert myvec .>= 0
:(if myvec .>= 0
      nothing
  else
      Base.throw(Base.AssertionError("myvec .>= 0"))
  end)

Topic		Replies	Views
Question on lazy map function New to Julia function , lazy-evaluation	10	3908	January 17, 2021
Not understanding iterators and lazy evaluation General Usage lazy-evaluation	11	1160	October 14, 2020
Defining a function with Iterators.filter - slower performance? New to Julia question	4	785	August 18, 2021
Filtering elements of a generator without collecting General Usage question , iterators	5	353	March 27, 2023
Should there be a `mapfilter` function or macro? General Usage	5	473	August 6, 2021

How to lazily filter an iterator while also applying a function to it

Related topics