Efficient way of turning iterator into a matrix

pdeffebach · April 27, 2020, 2:53pm

I have the following goal:

Take in an iterator which does not necessarily have eltype(itr) defined.

If the iterator is looping through numbers, return a vector of those numbers.
If the iterator is not through numbers, return a matrix where each row of the matrix is a collected element of the iterator.

The most naive implementation is

_collect(x::Number) = x
_collect(x) = collect(x)

function mycollect(itr)
    mapreduce(transpose ∘ _collect, vcat, itr)
end

Does anyone have an idea for how to do this more efficiently?

simeonschaub · April 27, 2020, 3:12pm

Does the iterator and the iterators inside the iterator have a known length?

pdeffebach · April 27, 2020, 3:23pm

no. But we can assume the lengths are all equal.

mcabbott · April 27, 2020, 3:26pm

Then I have a package which will do this:

julia> tups = ((1,i,i^2) for i in Iterators.filter(_->rand(Bool), 1:1000));

julia> Base.haslength(tups)
false

julia> eltype(tups)
Any

julia> using LazyStack

julia> stack(_collect(t) for t in tups)
3×515 Array{Int64,2}:
 1  1   1   1    1    1    1    1    1    1  …       1       1       1       1
 2  3   8   9   10   11   13   14   15   16        988     989     991     999
 4  9  64  81  100  121  169  196  225  256     976144  978121  982081  998001

julia> stack(last(t) for t in tups)' # numbers
1×495 LinearAlgebra.Adjoint{Int64,Array{Int64,1}}:
 4  100  121  144  289  529  625  841  900  …  982081  990025  994009  996004

julia> @btime mycollect($tups);
  374.261 μs (1770 allocations: 2.38 MiB)

julia> @btime stack(_collect(t) for t in $tups);
  36.588 μs (445 allocations: 71.83 KiB)

The need for _collect here (with tuples) is a bug, only for the case of unknown iterator length, i.e. stack((1,i,i^2) for i in 1:10) works, but stack(tups) does not, right now.

pdeffebach · April 27, 2020, 5:25pm

Here is a solution

function _c(itr)
    n = length(first(itr))
    [xj[i] for xj in itr, i in 1:n]
end

EDIT: weirdly enough this does not work with tups defined above. Not sure why, though.

julia> t = ((1, i, i^2) for i in 1:4)
Base.Generator{UnitRange{Int64},var"#37#38"}(var"#37#38"(), 1:4)

julia> _c(t)
4×3 Array{Int64,2}:
 1  1   1
 1  2   4
 1  3   9
 1  4  16

julia> t = ((1, i, i^2) for i in Iterators.filter(isodd, 1:4))
Base.Generator{Base.Iterators.Filter{typeof(isodd),UnitRange{Int64}},var"#39#40"}(var"#39#40"(), Base.Iterators.Filter{typeof(isodd),UnitRange{Int64}}(isodd, 1:4))

julia> _c(t)
6-element Array{Int64,1}:
 1
 1
 1
 3
 1
 9

mcabbott · April 27, 2020, 6:22pm

You can just reshape it:

function c_cols(itr)
    n = length(first(itr))
    reshape([xj[i] for i in 1:n, xj in itr], n, :)
end

Applied to tuples, this turns out to be much quicker than what stack is doing (which I think is copyto!). But slower for vectors.

pdeffebach · April 27, 2020, 6:34pm

I still don’t fully understand why adding an Iterators.Filter messes with the output. It appears to be splatting the tupples, filtering out all the elements of 1, i, i^2 and iterating through the values themselves…

pdeffebach · April 27, 2020, 6:47pm

Ah sorry. My implementation gives what I want. I mean that the first row of the matrix is collect(first(itr)).

mcabbott · April 27, 2020, 7:04pm

There is no splatting, it just doesn’t give shapes to generators built of others whose shape is unknown:

t1 = ((1,2,3) for i in 2:2:10)
t2 = ((1,2,3) for i in Iterators.Filter(iseven, 1:10))
t3 = ((1,2,3) for i in Iterators.Filter(iseven, hcat(1:5, 6:10)))
Base.IteratorSize(t1) # Base.HasShape{1}()
Base.IteratorSize(t2) # Base.SizeUnknown()
Base.IteratorSize(t3) # Base.SizeUnknown()

m1 = (t[i] for i in 1:3, t in t1) # Iterators.product
m2 = (t[i] for i in 1:3, t in t2)
m3 = (t[i] for i in 1:3, t in t3)
Base.IteratorSize(m1) # Base.HasShape{2}()
Base.IteratorSize(m2) # Base.SizeUnknown()
Base.IteratorSize(m3) # Base.SizeUnknown()

You are collecting something like m2.

But does your real problem contain tuples? These were just the first non-vector objects which came to mind.

pdeffebach · April 27, 2020, 7:24pm

Iterators of tuples and named tuples would be useful. Just taking any collection of “observations” and putting it into a matrix where each row is an observation. So it’s helpful to be agnostic about what constitutes an observation.

mcabbott · April 27, 2020, 7:49pm

For things which aren’t tuples, notice that the wrong iteration order is fairly expensive:

@btime c_cols(collect(t) for t in $tups); # 31.931 μs (1016 allocations: 103.06 KiB)
@btime c_rows(collect(t) for t in $tups); # 65.194 μs (1517 allocations: 196.84 KiB)
@btime permutedims(c_cols(collect(t) for t in $tups)); # 32.206 μs (1018 allocations: 103.16 KiB)

pdeffebach · May 1, 2020, 7:36pm

julia> function _flat(x)
           n = length(first(x))
           reshape(collect(Iterators.flatten(x)), :, n)
       end

Topic		Replies	Views
Collect (flat) iterator as an Array General Usage question , iterators	3	233	January 10, 2024
Inverting default `collect` behavior when iterator returns an array General Usage iterators	3	320	August 7, 2022
Iterating over the columns of a matrix Performance question , iterative , arrays	3	9304	May 13, 2020
Converting between Matrices, Tuples, and Iterators New to Julia	14	963	June 22, 2021
Grid of all elements of an arbitrarily sized matrix General Usage iterators	7	508	October 20, 2022

Efficient way of turning iterator into a matrix

Related topics