Implement iterator over subsets of array

Suppose I have an array like

x = [ 1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4 ]

I want to loop over the sets of sequences of common numbers, using a syntax like:

for set in eachset(x)
   ...
end

Where eachset(x) should behave as an array of sets.

If I define

eachset(x) = [ findall(isequal(i),x) for i in unique(x) ]

I get the indexes of the elements of each set and I could use that (although I only need the ranges really).

But I understand that I do not need to really allocate that array to iterate over its elements. What do I have to implement to get an iterator instead of an array?

1 Like

x is always sorted in proper order? If yes, then you only need to implement iterate. It is very simple, consider this tutorial for example: Writing Iterators in Julia 0.7

1 Like

It is. It is slightly more complicated than that, because the vector is not simply a vector of numbers, but a vector of structures which contain the counter, but the sets are sequential in the original vectors.

Thanks for the link. If anything changed from 0.7 please let me know.

Not any changes that I know of. It works the same on 1.7, at least my last Iterator did.

1 Like

Since the topic was named implement I forget that there are other options (like use). Maybe this one will useful: Introduction · IterTools

1 Like

You can use a generator instead of array comprehension to avoid allocating the main array:

( findall(isequal(i),x) for i in unique(x) )

instead of

[ findall(isequal(i),x) for i in unique(x) ]

but that’s still quite inefficient compared to a hand-crafted iterator…

Maybe if I provide some more information on the problem it gets clearer:

I have a struct named Atom, which contains the information of the atoms of my system. To simplify, let us suppose that it has 2 fields, the name and the molecule to which it belongs, i. e.:

struct Atom
  name::String
  molecule::Int
end

Now I have a vector of “atoms”, for example with 2 water molecules:

julia> atoms = [ Atom("O",1), Atom("H",1), Atom("H",1), 
                 Atom("O",2), Atom("H",2), Atom("H",2) ]
6-element Vector{Atom}:
 Atom("O", 1)
 Atom("H", 1)
 Atom("H", 1)
 Atom("O", 2)
 Atom("H", 2)
 Atom("H", 2)

The molecules are always consecutive (not necessarily consecutive, but the molecule numbers are unique for each molecule), and are numbered according to the molecule field. What I want is to iterate over the molecules, with:

for molecule in eachmolecule(atoms)
    ...
end

So I have to implement the eachmolecule function that generates the iterator.

One thing that I have to decide what is one molecule. It may be a vector of atoms, a view of a vector of atoms, or another struct with the range of atoms of the original vector of atoms (the option I am leaning to).

Working on it from the information provided. Thanks!

If you wanna follow the example of Iterators.partition it uses SubArray.

help?> Iterators.partition
  partition(collection, n)

  Iterate over a collection n elements at a time.

  Examples
  ≡≡≡≡≡≡≡≡≡≡

  julia> collect(Iterators.partition([1,2,3,4,5], 2))
  3-element Array{SubArray{Int64,1,Array{Int64,1},Tuple{UnitRange{Int64}},true},1}:
   [1, 2]
   [3, 4]
   [5]

I think following the trail given by @edit Iterators.partition([1, 2, 3, 4], 2) may give you an example of how to implement the iterator you want, as Iterators.partition (or better, the iterate for PartitionIterator) is probably the code in Base that most closely resembles what you want.

1 Like

How about something like this

struct Atom
  name::String
  molecule::Int
end

atoms = [ Atom("O",1), Atom("H",1), Atom("H",1), 
                 Atom("O",2), Atom("H",2), Atom("H",2) ]

struct EachMolecule 
    atoms::Vector{Atom}
end

eachmolecule(atoms) = EachMolecule(atoms)

function Base.iterate(em::EachMolecule, state = 1)
    r0 = state
    r0 > length(em.atoms) && return nothing
    m0 = em.atoms[r0].molecule
    r1 = r0
    while r1 <= length(em.atoms)
        em.atoms[r1].molecule != m0 && return (r0:r1 - 1, r1)
        r1 += 1
    end

    return (r0:r1 - 1, r1)
end
julia> for molecule in eachmolecule(atoms)
           println(molecule)
       end
1:3
4:6
2 Likes

That was the part over which I was beating my brains out :slight_smile:

This is the second large contribution you give to that package in terms of how things are done! I promise that when it becomes something useful you will be justly acknowledged. Thank you very much!

2 Likes