Not understanding iterators and lazy evaluation

Let’s say I have a big array:

big = rand(1:4, 8_000_000)

I have a loop that will run 180 times and filter big and do something with the results. Each time I filter big, the result vector will be a different size (trivially in the example, but with much more variation in the real code) so it’s hard to pre-allocate and just update in place.

Instead, I want the filter to be an iterator so I can loop through the results one at a time–the filtered array is never created/allocated: I just get one scalar per iteration. At least, this is what I am assuming. But, I can’t get it work out that way. Or, maybe I am expecting the wrong thing.

julia> big = rand(1:4, 8_000_000);

julia> ifiltbig = Iterators.filter(x->x==4, big)
Base.Iterators.Filter{var"#119#120",Array{Int64,1}}(var"#119#120"(), [2, 4, 3, 4, 2, 2, 3, 4, 2, 4  …  3, 4, 4, 3, 1, 4, 4, 4, 1, 4])

julia> @btime for i in $ifiltbig
       i * 4 # we really do more with i...
       end;
  18.252 ms (0 allocations: 0 bytes)

julia> @btime for i in filter(x->x==4, $big)
       i * 4 # really, do more...
       end;
  12.433 ms (3 allocations: 61.04 MiB)

So, my largest source array is 8.4 million rows so this is close. The iterator version is slower but has no allocations. So, certainly less RAM consumed though RAM isn’t a problem with an 8 million by 13 array of Int. The non-iterator version has to create a temporary array to hold the result of the filter which allocates a lot of RAM (though the amount doesn’t make sense…) and runs faster iterating over the storage of the filtered result. So, no allocations is not helping execution speed.

I would have expected the allocated memory to be 8 bytes * 8 million items * approximately .25 for the filtered size = approx. 1.6e7 bytes.

As Jefferson would say, “What did I miss?” (Hamilton musical…)

Looking at the definition of filter, it allocates a full-sized array and subsequently resizes it after it knows how many elements satisfy the condition. This might explain the allocations that you see.

function filter(f, a::Array{T, N}) where {T, N}
    j = 1
    b = Vector{T}(undef, length(a))
    for ai in a
        @inbounds b[j] = ai
        j = ifelse(f(ai), j+1, j)
    end
    resize!(b, j-1)
    sizehint!(b, length(b))
    b
end

If you ‘do more with i’, this may be more expensive than the filter aspect anyway.
Also, I think minimizing allocations is not always best for performance. I guess it often is a good approach, but there seem to be cases where allocations “are worth it”.

In you example I am wondering if the allocation is actually considerably less expensive than the comparisons

@btime filter(x->x==4, $big); #12ms
@btime zeros(Int,div(size($big,1),4)); #3.7ms

@bernhard

the operation is likely more expensive, but unavoidable…

@jishnum

for sure, the filter causes 2 big allocations–>that’s why I am trying to express an iterator to do one at a time…

I am afraid I won’t be able to express this loop using an interator. Iterator semantics are pretty challenging in Julia (compared to Python) and the builtin iterators return values, not indices.

What I really need is an iterator filter (Iterators.filter or Iterators.map) that returns each index of a match (a true outcome) rather than the value that results in true. Is this possible?

are pretty challenging in Julia (compared to Python) and the builtin iterators return values, not indices.

in python, for v in list v iterates through elements of a list, not the indices.

Are you aware of eachindex? You can do for i in eachindex(x) ... to iterate through the indexes. Unfortunately it doesn’t look like this works with Iterators.filter, which is annoying.

2 Likes

I think there are some things worth mentioning:

  1. Copying data is not always bad.
  2. I imagine you have no way to do the 180 filters first, right? (i.e., put the 180 conditions in a single function and filter a single time.)
  3. Iterators.filter(x->x==4, big) is a lazy construct. Often lazy constructs will have some overhead if you actually evaluate every single member of them. They are best if you want something behaving like a filtered list, but without needing to filter the whole list because it will not be necessary (e.g…, you are interested only in the fist X elements respecting some condition from a much larger array).
  4. Can’t you do:
@btime for i in $big
    i != 4 && continue
    i * 4
end

Or

@btime for i in (v for v in $big if v == 4)
    i * 4
end

Or

import Transducers

@btime for i in Transducers.Filter(isequal(4))
    i * 4
end

I do not have the time to test them right now. Maybe will give times later.

1 Like

The problem is that OP wants to work with the indices of the results of this filter, not just the values themselves.

skipmissing, for example, allows eachindex and lets you go through the indices of the non-missing elements. These lazy filters don’t do that.

Ok, but then, they can iterate the indexes with eachindex and use the index to access the value inside the block, or use enumerate and get both index and value. You can always iterate in the index space and use the index and a reference to the array to do the same as you were iterating the value space. Seems orthogonal to me, but maybe I am missing something.

You can’t use eachindex with objects returned from Iterators.filter or with generators. That’s the problem, I think.

Yes, another poster pointed out the obvious…

Just test iterator over rows and test the value in the loop body and do something with true outcomes.

Bunch of ways to accomplish this. Sometimes Julia seems to induce excess “cleverness” in us (me?, some of us?) when simple will do.

Yeah:

I kept trying to use eachindex, but couldn’t make it work.

Turns out other posters answered my simpler topic: just iterate the rows; do the test in the loop body; do what I want on the “true” branch of the test. Doh!

That’s right. (re: Henrique….)

I think we can close this. The generic answer is efficiently iterator rows or row indices (several equally efficient approaches are possible). Do the filter test on each row/col value in the loop body. Do the operations needed on rows that pass. Simple.

Too much tendency to attempt “sophisticated” approaches like iterators, map, etc. etc. when simple stuff is often best.