How to lazily filter an iterator while also applying a function to it

Hi,

I’m trying to figure out how to do the following: given an iterator A, create a generator that goes through each element v of A, and either yields something as a function of v or simply moves on to the next element v' without yielding anything at all. Consider the following example:

A = 1:100
function flt(v)
    if v%3 == 1
        return DoNotYield
    end
    v ^= 2
    if v%5 == 1
        return DoNotYield
    end
    return v
end

I want some kind of function filtermap such that filtermap(flt, A) returns a generators that yields n^2 for every number n between 1 and 100 such that (1.) n \not\equiv 1 (mod 3) and (2.) n^2 \not\equiv 1 (mod 3).

My actual use case is that that I have a bunch of data, and the function I have to apply to each data point is slow. It’s important that at any point in the function I want to be able to decide that I don’t want this data point, and so don’t yield it. This is why I can’t simply do Iterators.filter(flt, f(a) for a=A) or (f(a) for a=Iterators.filter(flt,A)).

An implementation of the desired function is not hard:

struct DoNotYield;end
function filtermap(flt, itr)
    Iterators.filter(v->v!=DoNotYield, flt(v) for v=itr)
end

but I wonder if there’s a built-in, and hence more standard/better optimized way to do it.

Does anyone know how this is “supposed” to be done? Thanks.

EDIT: The example I gave is misleading because it makes it seem like checking the condition for each item is fast. My real filter function looks more like this:

function flt(v)
    cond1(v) && return DoNotYield
    v = costly_computation1(v)
    cond2(v) && return DoNotYield
    v = costly_computation2(v)
    cond3(v) && return DoNotYield
    return v
end

I’m not sure if this is quite what you’re looking for, but the built-in generator syntax can handle this kind of thing:

julia> gen = ( n^2 for n in 1:100 if ( mod(n,3) != 1 && mod(n^2,3) != 1) )
Base.Generator{Base.Iterators.Filter{var"#8#10",UnitRange{Int64}},var"#7#9"}(var"#7#9"(), Base.Iterators.Filter{var"#8#10",UnitRange{Int64}}(var"#8#10"(), 1:100))

julia> for g in gen
           println(g)
       end
9
36
81
144
225
324
441
576
729
900
1089
1296
1521
1764
2025
2304
2601
2916
3249
3600
3969
4356
4761
5184
5625
6084
6561
7056
7569
8100
8649
9216
9801

1 Like

why not

(n^2 for n in Iterators.filter(x->(x%3 != 1 || x%5 !=1), -5:5 ))

?

Sorry for being unclear. The example is maybe a poor one. The problem I have is that processing each element of the iterator is costly, and I want the ability to, at any point in processing it, decide that I don’t want to yield anything for it.

Ok, how about something along these lines?

function expensive(n)
         return rand() > 0.5 ? nothing : n
end

(n^2 for n in Iterators.filter(!isnothing, expensive.(1:10)))

I think the expensive function is where the n^2 is, but wants to somehow control the filtering step. I don’t really see how to do it with generators and filters in one pass then though.

I’d probably just make a vector of results by push!ing to it whenever I get a datapoint I want to include. Seems like a simple 1-pass way to do it, though I guess the format of the results you want depends on the later steps (maybe you don’t want a vector).

edit: I guess if eveything is lazy then you can compose as you wish and it isn’t 2 passes through the data.

Oh, if the n^2 is expensive, I’d just make that decision inside of expensive() and not worry about mapping. Just filter on !isnothing (or use a sentinel for type stability) and you’re all set.

1 Like

This is how I was doing things originally, but my dataset is too big for me to store the whole vector of results. I need to be able to do the evaluation lazily to avoid running out of memory.

So something like this?

Iterators.filter(!isnothing, flt(v) for v=itr)

where flt in this case is

function flt(v)
    cond1(v) && return nothing
    v = costly_computation1(v)
    cond2(v) && return nothing
    v = costly_computation2(v)
    cond3(v) && return nothing
    return v
end

This seems like it would work! Thank you!

yup. It’s not type-stable, but if you have a sentinel value of type v you could use as the filter condition, that’d work too.

I don’t think that’s possible, as my type is CSV.Row2, which doesn’t seem to be easily constructable without reading from an actual file. Do you think that the type-instability matters a lot for performance?

It shouldn’t, right? All the expensive stuff is happening within expensive.

it shouldn’t be too bad. It’s a Union{nothing, typeof(v)}, which seems better to me than Any, but I don’t know.

If you’re using CSV.Row2, what about an empty row? You’d have to make sure you’re not actually expecting this as a result.

CSV.Row2(Vector{Symbol}(), Dict{Symbol, Int64}(), Vector{UInt64}(), Vector{UInt8}(), 0x0, 0)

I’d try a comparison to see what performance hit the type-unstable gives you, if any, and prefer that over the sentinel approach if it’s not too bad.

Seems reasonable. For mostly aesthetic reasons, I might just go with nothing after checking to see if it makes a performance difference.

that’s what I was trying (poorly) to say :slight_smile:

It usually is. See https://julialang.org/blog/2018/08/union-splitting/

2 Likes

That was a great read. Thank you!

Quick question about this actually: in this context does one need to annotate the function with the union type:

function expensive(...)::Union{CSV.Row2, Nothing}
    ...
end

or is this handled automatically somehow (as long as all the parameters and internal variables are properly annotated)?

Typically you don’t need to annotate anything at all; in well-written code, Julia will figure out all the types and do it for you. Worth through the example at https://docs.julialang.org/en/latest/manual/performance-tips/#man-code-warntype-1 and you’ll see this in action. It looks like most of that section was written before we had union-splitting, so it could use a rewrite, but at least the core principles remain relevant even if the performance consequences have been greatly mitigated.

The only common cases where you need to annotate types are (1) the fields of a struct that you create and (2) when you’re trying to control dispatch (control which methods get called for which object types). Annotating the arguments of functions has no performance impact, it’s just to control dispatch. There are also specific circumstances where Julia’s inference doesn’t/can’t possibly work (e.g., returning either an Int or a Float64 based on reading data from a file, since Julia can’t possibly predict the contents of an arbitrary file at the time the code is compiled) where adding ::Union{A,B} might help, but you should use @code_warntype to detect such cases rather than assuming that you should annotate everything (which typically leads to inflexible code).

Keep in mind splitting happens only with “small” Unions (IIRC 3 or fewer concrete types), and you also can’t have too many “uncorrelated” unions (where you end up getting the outer product over all possible types) in the same code.

5 Likes

Huh, I had no idea that type assertions on function arguments didn’t improve performance! I seem to remember that sometimes a statement like

@assert myvec .>= 0

can speed up following code. Is this still true? I can’t seem to find any documentation of “places where @assert can improve performance.”

I think such assertions are simply a shorthand form for an if statement:

julia> @macroexpand @assert myvec .>= 0
:(if myvec .>= 0
      nothing
  else
      Base.throw(Base.AssertionError("myvec .>= 0"))
  end)