Why is it that dataframe violates iterator interface?

using DataFrames
df = DataFrame(:a=>[1,2,3])
Base.IteratorSize(DataFrame)   # Base.HasLength()
length(df)   # method length(::DataFrame) doesn't exist

I’m ok with DataFrames not having length, but in that case why

Base.IteratorSize(DataFrame) == Base.HasLength()
2 Likes

To my understanding, DataFrame does not implement the iterator interface, i.e.,

julia> for x in df
           println(x)
       end
ERROR: AbstractDataFrame is not iterable. Use eachrow(df) to get a row iterator or eachcol(df) to get a column iterator

julia> Base.IteratorSize(typeof(eachrow(df)))
Base.HasShape{1}()

What you are seeing is just the default implementation of IteratorSize:

julia> @which Base.IteratorSize(DataFrame)
Base.IteratorSize(::Type) in Base at generator.jl:93

which is implemented as IteratorSize(::Type) = HasLength() # HasLength is the default and simply returns HasLength() for any type.

2 Likes

But since length is not defined, maybe it would be better to overload this?

Maybe, but this would need to be done for any type then:

julia> struct MyType end

julia> length(MyType())
ERROR: MethodError: no method matching length(::MyType)

julia> Base.IteratorSize(MyType)
Base.HasLength()

Thus, currently Base.IteratorSize cannot be actually used to check that length will work as it’s only meaningful for types that actually care for the iterator interface. In particular, types not depending on that interface will not bother with overwriting this method (why should they).
Arguably, the best fix would probably be to remove the default implementation of IteratorSize. Yet at this stage, this will probably be breaking … on the other hand, it would not matter much for code outside of iterators anyways (or why would you want to check IteratorSize on a type that is not iterable?).

1 Like

In 1.10 (iirc) we will get Tricks.jl style compile-time hasmethod we might be able to use that to give a implementation of IteratorSize that actually checks what methods are defined.
Then you would only need to overload it is you were doing something odd.

We also could have it return an error status if the thing did not define iterate which iirc DataFrames do not. Right now there is no status for this since we assume people only try to use it on iterators, as was stated

4 Likes

I do agree, calling IteratorSize on a completely generic object (i.e., one that you have no information if it is even an iterable) is a code smell. Your code should either assume it is (or is not) an iterable, or pass this information explicitly along the generic object.

They do, it’s the iterate method that throws:

julia> iterate(DataFrame())
ERROR: AbstractDataFrame is not iterable. Use eachrow(df) to get a row iterator or eachcol(df) to get a column iterator
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] iterate(#unused#::DataFrame)
   @ DataFrames ~/.julia/packages/DataFrames/LteEl/src/abstractdataframe/iteration.jl:23
 [3] top-level scope
   @ REPL[4]:1
1 Like

this comment seems irrelevant to question…