Not understanding iterators and lazy evaluation

lewis · October 13, 2020, 10:36pm

Let’s say I have a big array:

big = rand(1:4, 8_000_000)

I have a loop that will run 180 times and filter big and do something with the results. Each time I filter big, the result vector will be a different size (trivially in the example, but with much more variation in the real code) so it’s hard to pre-allocate and just update in place.

Instead, I want the filter to be an iterator so I can loop through the results one at a time–the filtered array is never created/allocated: I just get one scalar per iteration. At least, this is what I am assuming. But, I can’t get it work out that way. Or, maybe I am expecting the wrong thing.

julia> big = rand(1:4, 8_000_000);

julia> ifiltbig = Iterators.filter(x->x==4, big)
Base.Iterators.Filter{var"#119#120",Array{Int64,1}}(var"#119#120"(), [2, 4, 3, 4, 2, 2, 3, 4, 2, 4  …  3, 4, 4, 3, 1, 4, 4, 4, 1, 4])

julia> @btime for i in $ifiltbig
       i * 4 # we really do more with i...
       end;
  18.252 ms (0 allocations: 0 bytes)

julia> @btime for i in filter(x->x==4, $big)
       i * 4 # really, do more...
       end;
  12.433 ms (3 allocations: 61.04 MiB)

So, my largest source array is 8.4 million rows so this is close. The iterator version is slower but has no allocations. So, certainly less RAM consumed though RAM isn’t a problem with an 8 million by 13 array of Int. The non-iterator version has to create a temporary array to hold the result of the filter which allocates a lot of RAM (though the amount doesn’t make sense…) and runs faster iterating over the storage of the filtered result. So, no allocations is not helping execution speed.

I would have expected the allocated memory to be 8 bytes * 8 million items * approximately .25 for the filtered size = approx. 1.6e7 bytes.

As Jefferson would say, “What did I miss?” (Hamilton musical…)

jishnub · October 14, 2020, 6:45am

Looking at the definition of filter, it allocates a full-sized array and subsequently resizes it after it knows how many elements satisfy the condition. This might explain the allocations that you see.

function filter(f, a::Array{T, N}) where {T, N}
    j = 1
    b = Vector{T}(undef, length(a))
    for ai in a
        @inbounds b[j] = ai
        j = ifelse(f(ai), j+1, j)
    end
    resize!(b, j-1)
    sizehint!(b, length(b))
    b
end

bernhard · October 14, 2020, 8:19am

If you ‘do more with i’, this may be more expensive than the filter aspect anyway.
Also, I think minimizing allocations is not always best for performance. I guess it often is a good approach, but there seem to be cases where allocations “are worth it”.

In you example I am wondering if the allocation is actually considerably less expensive than the comparisons

@btime filter(x->x==4, $big); #12ms
@btime zeros(Int,div(size($big,1),4)); #3.7ms

lewis · October 14, 2020, 5:06pm

@bernhard

the operation is likely more expensive, but unavoidable…

@jishnum

for sure, the filter causes 2 big allocations–>that’s why I am trying to express an iterator to do one at a time…

I am afraid I won’t be able to express this loop using an interator. Iterator semantics are pretty challenging in Julia (compared to Python) and the builtin iterators return values, not indices.

What I really need is an iterator filter (Iterators.filter or Iterators.map) that returns each index of a match (a true outcome) rather than the value that results in true. Is this possible?

pdeffebach · October 14, 2020, 5:38pm

are pretty challenging in Julia (compared to Python) and the builtin iterators return values, not indices.

in python, for v in list v iterates through elements of a list, not the indices.

Are you aware of eachindex? You can do for i in eachindex(x) ... to iterate through the indexes. Unfortunately it doesn’t look like this works with Iterators.filter, which is annoying.

Henrique_Becker · October 14, 2020, 6:00pm

I think there are some things worth mentioning:

Copying data is not always bad.
I imagine you have no way to do the 180 filters first, right? (i.e., put the 180 conditions in a single function and filter a single time.)
Iterators.filter(x->x==4, big) is a lazy construct. Often lazy constructs will have some overhead if you actually evaluate every single member of them. They are best if you want something behaving like a filtered list, but without needing to filter the whole list because it will not be necessary (e.g…, you are interested only in the fist X elements respecting some condition from a much larger array).
Can’t you do:

@btime for i in $big
    i != 4 && continue
    i * 4
end

Or

@btime for i in (v for v in $big if v == 4)
    i * 4
end

Or

import Transducers

@btime for i in Transducers.Filter(isequal(4))
    i * 4
end

I do not have the time to test them right now. Maybe will give times later.

pdeffebach · October 14, 2020, 6:03pm

The problem is that OP wants to work with the indices of the results of this filter, not just the values themselves.

skipmissing, for example, allows eachindex and lets you go through the indices of the non-missing elements. These lazy filters don’t do that.

Henrique_Becker · October 14, 2020, 6:10pm

Ok, but then, they can iterate the indexes with eachindex and use the index to access the value inside the block, or use enumerate and get both index and value. You can always iterate in the index space and use the index and a reference to the array to do the same as you were iterating the value space. Seems orthogonal to me, but maybe I am missing something.

pdeffebach · October 14, 2020, 6:11pm

You can’t use eachindex with objects returned from Iterators.filter or with generators. That’s the problem, I think.

lewis · October 14, 2020, 6:13pm

Yes, another poster pointed out the obvious…

Just test iterator over rows and test the value in the loop body and do something with true outcomes.

Bunch of ways to accomplish this. Sometimes Julia seems to induce excess “cleverness” in us (me?, some of us?) when simple will do.

lewis · October 14, 2020, 6:14pm

Yeah:

I kept trying to use eachindex, but couldn’t make it work.

Turns out other posters answered my simpler topic: just iterate the rows; do the test in the loop body; do what I want on the “true” branch of the test. Doh!

lewis · October 14, 2020, 6:23pm

That’s right. (re: Henrique….)

I think we can close this. The generic answer is efficiently iterator rows or row indices (several equally efficient approaches are possible). Do the filter test on each row/col value in the loop body. Do the operations needed on rows that pass. Simple.

Too much tendency to attempt “sophisticated” approaches like iterators, map, etc. etc. when simple stuff is often best.

Topic		Replies	Views
Defining a function with Iterators.filter - slower performance? New to Julia question	4	785	August 18, 2021
Struggling to express iterator that returns an index General Usage	2	417	October 14, 2020
How to lazily filter an iterator while also applying a function to it New to Julia question	21	1557	March 18, 2020
Question on lazy map function New to Julia function , lazy-evaluation	10	3908	January 17, 2021
Lazy Iteration General Usage	4	878	April 8, 2019

Not understanding iterators and lazy evaluation

Related topics