Getting the last element of an iterator

Question 1: Is there a nice one-liner to get the last element of an iterator, without collecting in a temporary array? The obvious doesn’t work:

julia> eachline(`dir`) |> last
ERROR: MethodError: no method matching lastindex(::Base.EachLine{Base.PipeEndpoint})

Which leads to…

Question 2: Why is last not defined on iterators? The last documentation says:

Get the last element of an ordered collection, if it can be computed in O(1) time.

Why this limitation?

BTW I suspect the person who wrote the documentation for Base.Iterators.only also expected this to work because it refers to last

1 Like

The last element of an iterator is clearly not O(1). also note an iterator may not be finite (also probably because that iterator interface is not defined by some type, unlike Range which is also lazily represented but can be lasted easily).

5 Likes

But why implement last only when it’s O(1)?

As for finite/infinite… by this logic collect should not be supported either?

I think it is more that collect is clearly O(n), there is no surprise there, as there is no other way it could be. Someone using last may unconsciously expect it to be O(1). More than that, if last has this complexity guarantee, then anyone may write a generic algorithm using it and give the expected complexity for their algorithm.

4 Likes

Presumably though you could define an Iterators.last that does something trivial like:

function Iterators.last(itr)
    # should throw in a check for Base.IsInfinite() 
    # before doing this. Also potentially defer to 
    # `Base.last` when a method of `lastindex` exists
    x = nothing # or first(itr), if that makes more sense
    for y in itr
        x = y
    end
    return x
end
3 Likes

@Henrique_Becker this doesn’t seem like a very good deal…

We could have:

  • An algorithm that works both with arrays and iterators, O(1) in the first case, O(n) in the other case, which can be documented.
  • A generally useful last (including situations like mine).
  • People that really want this selective behavior can use lastindex with minimal differences in the code.

Instead we have;

  • An algorithm that doesn’t work at all in the second case.
  • Many situations where last would be natural but doesn’t work.
  • People that want last for iterators need to replace a function call with a for loop and temporary variable, a much more intrusive change.

(Note that in many cases the algorithm would not work with iterators anyway, since last would consume the elements.)

I disagree. I prefer code clearly broken than surprises. I have nothing against a "generic last", but I would prefer it to be separated. You can wrap your solution in a differently-named function, or even extend Base.last though it is not recommended.

2 Likes

Part of it is that the minimal interface only needs iterate, and doesn’t necessarily know anything about size. It’s optional to implement IteratorSize(IterType) and length(), let alone last. You could easily implement your own fallback last that works when an iterator HasLength() and errors otherwise. But right now you already get the error anyway.

It seems reasonable to leave it up to the implementer what optional methods to include with their iterator. Even if their iterator has an end, it could cause problems when a user expects last(iter) or iter[end] to give an answer quickly.

collect isn’t ‘supported’ for infinite iterators. If you work according to the interface, you should be calling Iterators.IteratorSize() before collect, an infinite iterator will return IsInfinite() and you should abort. Or instead collect the composition of your iterator with take, to guarantee termination.

For SizeUnknown() iterators, the generation phase is no less efficient than in the known size case, except the output array cannot be preallocated, and must be grown by pushing.

Checking for HasLength() is too weak, sinch HasShape() also guarantees finite size, and SizeUnknown() actually suggests finite size by virtue of not being IsInfinite(). It seems like the interface is lacking a way to distinguish between SizeUnknown() but definitely finite, when collect is safe, and SizeUnknown() but possibly infinite.

should just use Base.iterate and it will be fine for arrays.

Except if it needs the last value without performing an iteration, then it shouldn’t because Base.iterate is not a random access interface, while last implicitly requires that.

3 Likes

Thanks for all the answers! I’ve thought about it for a while and I have to say it still looks quite wrong to me. Please consider the following points:

Generic function names such as last should be generally useful. It’s weird and inconsistent to reserve a name for some data structures because of performance characteristics. If we want a constraint like this, it should be a different function with a hint in the name.

Do algorithm writers really want to prevent uses that don’t meet some performance guarantee? I doubt it. Much better to document the assumptions for the complexity guarantee!

Now on the specifics. The main idea seems to be:

No generic fallback for last. This way people can use last with a guarantee of O(1).

But that’s misleading: there is no such guarantee. For example SLinkedList (from LinkedLists.jl) has an O(n) implementation of last. Do we really want to say package authors should not include such methods, even when it’s very useful? Conclusion: it makes things more surprising, not less.

And there is the added surprise that reasonable code such as eachline(cmd) |> last will fail. It gives the API a bad feeling, like there are random holes that can catch you when writing generic code.

Regarding Iterators: Algorithmic complexity seems like a bad reason to put a method there. For example we need Iterators.filter and Iterators.map (Julia 1.6) because they have different semantics than the Base version (returning iterators rather than arrays).

I think a real effect of the current behavior is that many people will do the equivalent of collect |> last, which has terrible performance. It’s unfortunate that the API nudges people to do that.

1 Like

I’m arguing that this should be changed (for the reasons given in my previous comment).

lastiterate(itr) = foldl((_, y) -> y, itr)

or

lastiterate(itr) = applicable(last, itr) ? last(itr) :  foldl((_, y) -> y, itr)
2 Likes

The most general method for iterators that should be O(1) is

first(Iterators.reverse(iterator))

which works for any iterator type that has implemented reverse iteration.

However, currently eachline has no implementation of reverse-order iteration, so this throws a MethodError. In principle, it would be straightforward to implement reverse-order eachline for files, but it doesn’t seem possible to do efficiently for pipes.

(There is an Iterators.last(iterator, n) iterator that uses this reverse approach.)

9 Likes

For future reference, yes, this is what I want, specifically. The author should not be extending a Base method if the base documentation is generic and goes against the behavior they want to implement. At very least, if they do this then they should create a documentation for the specific method (over their data structure) with a warning.

I want package authors to not ignore performance guarantees set by the Base, so I can get a datastructure of some package and pass it to some algorithm that I know the complexity without needing to check if the author has violated any of the performances guaranteed. If it works then it is the right complexity. Simple. If I really do not care about the performance degradation I can carefully extend Base.last myself for the third-party data structure.

You are basically saying it is misleading the fact that package authors are not perfect programmers and may make mistakes that compromise API guarantees… there is no way to guarantee many things except by believing package authors will try to be faithful to the API, and disregarding packages that you discover to fail to do so.

I opened a PR in LinkedLists.jl to get this corrected, if possible.

And when the programmer hits that point, it will be clear to them what is the expected overhead. They may decide to not support iterators, or spend an extra minute to do a loop that avoids the extra allocation, or save that value when it was readily available before, or even rethink their datastructure. It seems to me a problem of bad programmers, not bad API.

2 Likes

The algorithmic complexity is more a reason why last hasn’t been implemented generically thus far, rather than a justification why it should never be implemented. As others have shown, there are some simple ways to do last when you know the iterator is finite. Maybe it’s time to incorporate a solution.

One simple approach would be to implement something for I/O cases alone. Relatively conservative would be lastline(io::IO...) perhaps borrowing from countlines, which already exists. And actually, would be nice to have something like Unix, e.g. tail(io::IO, n).

Another approach is a generic fallback last for finite iterators (more general than I/O). To sort out all the cases would take some thought, as @malacroi points out. But a conservative fallback could be implemented (e.g., only for HasLength()) with appropriate warnings and documentation. It may be fine if it’s O(>1) as long as not infinite.

A pull request might be entertained for either approach.

2 Likes

Are you really recommending type piracy as a general solution for this?

Yes and no. If you do not really care about breaking the API contract, then I do not see why you would care about a little type piracy (which, I say, is not a greater evil if you know what you are doing). However, as most problems solved by type piracy, you can just take the extra step and wrap the third-party type inside your own type, delegate all methods to the wrapped object, and then you do not have type piracy anymore.

I wish this were a bit easier than it actually is.