`collect` gives different result than `for` loop

I’ve somehow created an iterator function where calling collect on the Alignment struct returns a 5-element vector, while iterating produces only three elements.

Context: I’m working on a patch for BioAlignments (GitHub - MillironX/BioAlignments.jl at operations-fix), where I modified the Base.iterate function, but I don’t know how to debug this since stepping into the iterator doesn’t causes the input to become correct again. Does anyone know how this could happen?

Iterator code: BioAlignments.jl/alignment.jl at 5a8cb6bb3179e00adf7f750e26952257ca70f810 · MillironX/BioAlignments.jl · GitHub

MWE:

using BioAlignments

anchors = [
    AlignmentAnchor(0, 0, 0, OP_START),
    AlignmentAnchor(2, 2, 2, OP_SEQ_MATCH),
    AlignmentAnchor(2, 2, 3, OP_PAD),
    AlignmentAnchor(3, 3, 4, OP_SEQ_MATCH),
    AlignmentAnchor(3, 3, 5, OP_HARD_CLIP),
]
seq = AlignedSequence("ACG", anchors)
ref = "ACG"
aln = PairwiseAlignment(seq, ref)

@show collect(aln)

for (k, (i,j)) in enumerate(aln)
    @show k
    @show i
    @show j
    println(' ')
end

Output:

collect(aln) = [('A', 'A'), ('C', 'C'), ('G', 'G'), ('\0', '\0'), ('\0', '\0')]
k = 1
i = 'A'
j = 'A'
 
k = 2
i = 'C'
j = 'C'
 
k = 3
i = 'G'
j = 'G'

What is length(aln)? I’m guessing your discrepancy is there.

2 Likes

I’ve seen such behavior a few times. It usually happens when length claims one length, but actually iterating produces fewer elements. collect preallocates based on length, but doesn’t shrink to the number of elements actually produced (counting them may be slow).

Would be interesting if we could add a LengthBounded() trait to opt into that allows shrinking the resulting collection if fewer than length(itr) elements were actually produced.

2 Likes
julia> length(aln)
5