What is the agreed abstraction for the "size" of finite collections?

Suppose I have a type that is a collection of samples (e.g. DataFrame). Unlike a simple Vector for which the name length can be used to retrieve the number of elements, my type doesn’t have the function length, but instead a function npoints. Is there an API somewhere defining the generalized name for the number of elements in a finite collection? Maybe we could agree on something like nelms? This name is quite common in scientific work like in finite element methods, meshing algorithms, elements of a set…

As a side note, DataFrames.jl deprecated the use of length in favor of size(df, 2) but that doesn’t help much here because then we need to know the data layout of a frame and specify the second dimension explicitly. I have other types that behave like tables sometimes, but for which the name size is not appropriate.

Please let me know if you are aware of any generalized name in the Julia API, that would be helpful.

2 Likes

Generally, depending on which interface your type supports, you could define length (if it is an iterable) or size (for an abstract array, supporting getindex etc).

Otherwise, basically anything goes; you can define a method for length, or invent a new name for a function. The latter would make sense if the collection is something very specific and does not otherwise support either of the above interfaces. At the same time, I don’t think there is a generic name for these things, because each example is rather specific.

I am proposing a more general interface to represent finite collections, one that includes dataframes, vectors, arrays of finite size, and so on. Maybe this interface already exists under the name “indexable”, but many data structures like dataframes do not explicitly implement it. For these finite collections it would make sense to always be able to query the number of elements, however we don’t have a standard name for that currently as far as I know.

1 Like

I think length has that meaning. It seems useful to me to have a generic way to query this property. Perhaps DataFrames.jl could un-deprecate length(::DataFrame).

From the docs:

 length(collection) -> Integer

  Return the number of elements in the collection.
4 Likes

Can you clarify how this interface would be different from AbstractArray?

1 Like

If the type is iterable, I’d make the length method return a number consistent with the number of iterations that would be done over it.

To get the number of elements, one way would be making the type to be broadcastable for length. Thus you could just write sum(length.(x)) to obtain what you want.

2 Likes

I support this idea @yurivish, maybe DataFrames.jl could un-deprecate length or we could agree on a more intuitive name like nelms. Length has so many interpretations depending on the context…

I think arrays have a structure besides the fact that we have a finite set of elements. In the concept of an array there is some assumption about how these elements are layout. Finite collections do not have any layout, they are just indexable collections with a finite number of elements. So maybe the interface that is close to what I am seeking is the indexable interface. It lacks the length function in the Julia docs: https://docs.julialang.org/en/latest/manual/interfaces/#Indexing-1 We could either provide a default implementation for it like: length(c) = lastindex(c) - firstindex(c) + 1 or ask users to instead implement length for indexable collections, and then have the defaults firstindex(c) = 1 and lastindex(c) = length(c). My point is that indexable collections do not have a length in the API.

I think you misunderstood my concern @heliosdrm. The number of elements of [[1,2,3], [4,5,6]] is 2 for example since it has to elements, which happen to be arrays in this case. I am saying that we could have instead a dataframe df = DataFrame(a=[1,4], b=[2,5], c=[3,6]), and in that case it would make sense to call length(df) and get 2 as well. Right now we have to do size(df, 2). So writing algorithms that work on any finite collections is not easy.

Sorry, I don’t understand what this means. Array has a column major storage layout, but that is not part of the interface, <:AbstractArray types can store elements as they wish (eg sparse matrices), or calculate them on demand, etc.

1 Like

Tables.jl provide the abstraction for this. While the Tables interface doesn’t have a nrow function and not all objects that implement the Tables interface are guaranteed to have a defined number of rows, something like length(first(Tables.columns(t)) should be very robust.

The documented definition of length seems to fit the bill, but it’s a slightly odd name in some cases. I have a problem with seeing the “length” of a multi-dimensional array as the number of elements, or intuiting what it should mean for a table.

The word nelms seems very obscure to me. Matlab uses numel for this, which, imho, is clearer. Perhaps numele, numelements?

1 Like

Relatedly, without formalizing this information somehow you can’t build a general object or describe its sizing, which leads to this issue:

I am not sure of a good general solution, since there’s hard examples like BandedBlockBandedArray which is composed of bands and blockbandedness, or MultiScaleArray which is a full graphical structure.

2 Likes

To quote the important part from the link @Tamas_Papp gave
https://docs.julialang.org/en/latest/manual/interfaces/

Depends what IteratorSize trait you implement for your iterable.

Value returned by IteratorSize(IterType) Required Methods
HasLength() length(iter)
HasShape{N}() length(iter) and size(iter, [dim])
IsInfinite() (none)
SizeUnknown() (none)

length is the number of elements when handled linearly,
size is the shape.

It might have been nice to call these something else, but that ship has sailed.
Revisit before Julia 2.0.

1 Like

Yes, I agree that the name length is odd. As soon as we have a name that we can call to mean the number of elements, I don’t care much about what this name is. nelms seems short, like ndims that is why I think it is better than the alternatives.

So it seems like we miss a mandatory definition of length in the iterator interface, and that makes sense. You can iterate over collections or iterators of infinite length. However, we don’t have an interface for finite collections that requires the implementation of length and that is my concern. I would like to implement algorithms that work on any finite collection, a dataframe where the rows are the samples, an multi-dimensional array where the entries are ordered linearly to produce a final length, and other custom types for which the notion of length can be accessed. Right now it doesn’t like we have a concept for that. Developers cannot assume an indexable collection to have length defined. Makes sense? Is there anything we can do about it?

To make things more clear, I’ve implemented a bunch of methods for density ratio estimation here: https://github.com/xukai92/DensityRatioEstimation.jl They assume that we have a finite collection of samples x_nu and another finite collection of samples x_de. In the implementations, I am calling the function length to build for example the Kernel Gramian matrix of size length x length. But that doesn’t work with dataframes for example. We should be able to take a dataframe as a collection of samples (the rows), and build the Gramian matrix as if we had a vector of vectors.

Isn’t that HasLength()?

2 Likes

In this case, the old length method (and size(df,2)) would return 3, since the length of a DataFrame was the number of columns. This illustrates a problem with what you’re asking for, namely that for a generic table, are the “elements” rows, columns, or cells? I often switch between wanting to treat columns as units, rows as units, or individual cells as units. IIRC, this is why length was deprecated, because it wasn’t clear that there was a universal understanding of whether length should refer to the number of columns or the number of rows

But if you had a nelements method, I’d intuitively expect it to return the number of cells (that is, nrow x ncol).

Well, you could do ncol(df) or nrow(df) too. I understand your point that this is not generic, but I’m not sure there is a concept that generalizes across finite collections. I think it’s ok to expect that users might need to do some data transformation of they’re passing weird data structures.

I thought that in a dataframe the rows always referred to the samples? What is the dataframe where the samples are the columns?

I’m not a DataFrames.jl dev but I can imagine it is difficult to define what “element” is in DataFrame in consistent with other indexable collections like arrays and dicts. For example, how map should behave is not decided yet (https://github.com/JuliaData/DataFrames.jl/issues/1952 / https://github.com/JuliaData/DataFrames.jl/issues/2053) even after a long development effort.

It looks like your interest is actually finite collections of samples for statistics application (e.g., I don’t see interest in this discussion for other collection types like AbstractDict and AbstractSet although they are typically finite). In that case, using Tables.rows might be the best way to go. Something like

function finitesamples(x)
    if Tables.istable(x)
        x = Tables.rows(x)
    end
    if !(IteratorSize(x) isa Union{HasLength, HasShape})
        error("Some nice error message")
    end
    return x
end

This way, the return value of finitesamples is guaranteed to support length and each item corresponds to a “sample” if the input is a table. This handles any tables including DataFrame and also generic finite iterators. (Although you may need to wait until https://github.com/JuliaData/DataFrames.jl/pull/2051 to be released if you need zero-copy Tables.rows(df).)