Struggling to express iterator that returns an index

This is a follow-on to another post–> with a more specific question.

for the array big = hcat(rand(1:4,2_000_000), rand(1:5, 2_000_000)) I want to get the index for a filter on the first column in order to index the value in the second column. This is easy with findall but that will allocate an array for the result of the filter.

With findall:

julia> @btime @views for i in findall(==(4),big[:,1])  # use @views to avoid allocating the result of the slice
       big[i,2] # do something with this...
       end
  42.486 ms (1500748 allocations: 35.54 MiB)

From the allocations, we can tell that the result of findall was realized as an array of Ints. What I am trying to figure out is an iterator that will yield each matching index to the column, one at a time, so I can use the index in the loop body. I can’t figure out the combination of functions to express this.

Here is an iterator for the filter (works with map or filter) that doesn’t do the allocation, but trivially always returns 4, the filtered value.

julia> @btime @views for i in Iterators.filter(x->x==4, $big[:,1])
       i
       end
  4.838 ms (0 allocations: 0 bytes)

Here we see no allocations: Iterators.filter prevents us from creating the filtered array and @views prevents the slicing from creating an allocation. Pretty awesome except I want the index into the array big, not the resulting value. It would be awesome if there were an Iterators.findall!

(Note that in reality I use a named tuple of vectors so I don’t even have to do @views, but both are OK.)

How about just testing in the loop?

mat= rand(1:4, 100, 2)
for i in axes(mat,1)
    mat[i,1] == 4 || continue
    stuff(mat[i,2])
end

vecs = (a=rand(1:4, 100), b=randn(100));
for (i, a) in pairs(vecs.a)
    a == 4 || continue
    stuff(vecs.b[i])
end
1 Like

@mcabbott: yup, you are onto the right thing

Here is a way to do it after converting to a Tables.rowtable:

julia> @btime for i in Tables.rows($big)
       if i.status == 4
            i.agegrp # do stuff...
            end
       end
  5.887 ms (0 allocations: 0 bytes)

So, anything that creates a “free” row iterator will work. And it’s obvious. And it’s general because the test on row values can be more complicated.

Doh!