What is the agreed abstraction for the "size" of finite collections?

Tamas_Papp · December 30, 2019, 6:45am

While it may be convenient to think that way for some problems, that’s not implied by the interface for dataframes. In some contexts, a collection of columns, or a shaped iterator of elements as a matrix, also make sense.

As suggested by @tkf, you probably want the Tables.rows. Use cases like this were actually one of the main motivations for that package.

Generally, unless it is clear that there should be only one way to “iterate” over some collection, it is usually better API design not to provide a method for Base.iterate directly, but offer it via some shim that clarifies intent.

juliohm · December 30, 2019, 10:02am

Yes, I was typing size(df, 2) incorrectly without thinking. Please replace it with size(df, 1) in all my comments above.

juliohm · December 30, 2019, 10:10am

It is still strange to me how the concept of a dataframe is not the standard concept of a finite collection of samples (the rows) with identified variable names (the columns). I understand that some specific applications need to iterate over columns, and over the cells, but that doesn’t seem to be the purpose of the data structure for most applications (statistics). Why not fix the concept to mean what it usually means in statistics? Any iteration over columns or over cells seems to exist merely for internal implementations and not actually statistical end user applications? I am not sure I get the point of trying to support multiple concepts of iteration in a single data structure. There should be a primary method to iterate in my opinion, and any other method that is not the intuitive one should have a specific wrapper type that does the right thing like Tables.columns(df). What I am trying to say is that the concept of dataframe should be natually equipped with an iterator semantic Tables.rows(df) so that we could assume df[i] iterates over rows, not cells nor columns.

The question of what should be the behavior of map is another important question, but I think we don’t need to commit to answering it here now. I could think of at least two natural ways again for statistical applications. The first would be to simply map over the rows without any variables names (iteator of tuples), the second would be to iterate over named tuples. Does it make sense what I am trying to say?

I think we would benefit from a default iteration mode (the rows, the samples, the observations, however we call it), and then make alternative iteration modes available through the Tables.jl API for example by calling Tables.columns or Tables.entries or something.

juliohm · December 30, 2019, 10:11am

The main problem with this approach of introducing a bunch of branches in the code depending on the type is that it does not scale as you know. It is annoying to go over many algorithms that are defined for finite collections, and always introduce a if-else branch for dataframes, arrays, etc. Finite collections of samples are everywhere in statistics and other fields.

juliohm · December 30, 2019, 10:18am

A related question that someone just asked here on Discourse that I think is very connected to the discussion here: How to sample a Data frame Observe how most people interpret a dataframe like I mentioned above. Pandas has sampling methods to sample “observations” of a dataframe that are stored as rows: pandas.DataFrame.sample — pandas 1.4.4 documentation

Tamas_Papp · December 30, 2019, 10:35am

I am not sure about this, traits should scale just fine.

kevbonham · December 30, 2019, 5:50pm

Because There are lots of applications for DataFrames that are not statistics, and even when doing stats, samples-by-row are not the only way data can be laid out. I regularly have data where samples are columns. Plus which, julia is column-major so (I think… I’m not actually positive) iterating by row is actually more expensive.

I don’t actually agree that by row is the most intuitive. Neither do the DataFrames devs apparently, which is why it hasn’t been ironed out.

It makes sense, but you don’t seem to acknowledge that are other ways people might want to do things. Tables.jl exists precisely because there are a lot of table implementations, and there are folks that might want to iterate by columns or by rows or both. I am one of those people. Locking in the “preferred” way doesn’t make a lot of sense to me.

Isn’t this what multiple dispatch is for? Or traits as @Tamas_Papp mentioned. Or use Tables.jl. It seems to me there are plenty of workable solutions that acknowledge the diversity of use-cases.

I don’t think a couple of replies to a statistical question captures the concept of “how most people interpret” it.

juliohm · December 30, 2019, 6:39pm

Your comment seems to imply that I affirmed that I would like to remove all other iteration patterns. All I am saying is that every data structure has a natural first-class iteration mode. Other modes can be accessed with the Tables interface for example.

Really? Iterating over columns is the most intuitive mode for you? Can you please share the comments from the devs stating this intuition please?

You again misinterpreting my comment. I only said that df[i] could mean the i-th sample of the dataframe (a row). I never said that I would like to remove the other iteration modes accessed with Tables.columns(df) etc.

You are touching on very disconnected features here. Multiple-dispatch or traits do not solve design problems. They are features that you can use to create very nice abstract concepts like finite collections of samples. It doesn’t mean that because you have a set of traits that your code will look generic. Ideally I would like to have the code below work with dataframes, vectors of vectors or any type that implements finite collections of samples:

# generic Gramian matrix
N = length(samples)
K = zeros(N, N)
for i in 1:N, j in 1:N
  K[i,j] = kernel(samples[i], samples[j])
end

Fair enough. We can do a Google search for dataframe and increase the evidence of the interpretation I suggested. I guarantee to you that this will be the most popular interpretation, where popular here includes non-statisticians.

kevbonham · December 30, 2019, 9:06pm

Seems we’re getting a bit far afield from your original question about the abstraction of size. but it’s your thread

You’re asserting this, but I haven’t seen any evidence that row-based iteration is somehow natural. Your proposal to do some google searching could be informative, I’d be interested to see that analysis.

In many cases, yes. There are a bunch of discussions that have occurred on github, discourse, and slack. Here’s a place to start, but it’s not the whole story by any means. Perhaps @bkamins can weigh in, since he was a major driver for the new syntax.

Incidentally, before indexing with a single value was deprecated in DataFrames.jl, df[i] returned the i-th column, not row. So clearly at least one developer had a different notion than you w/r/t what is natural.

Perhaps I’m just misunderstanding what you’re trying to achieve. You were talking about a bunch of if-else branches depending on type. But one can do eg

my_func(x) = #some generic fallback
my_func(x::AbstractArray) = # whatever is unique to arrays
my_func(x::MyType) = #whatever is unique to MyType

rather than

function my_func(x)
    if x <: AbstractArray
         #whatever is unique to arrays
         # etc...

Traits still aren’t very intuitive for me, so I always have to look it up, but I know you could dispatch for example on whether Tables.rowaccess(my_thing) == true. No need for if-else branching.

Wouldn’t N = size(samples, 1) work in this case? This assumes of course that the samples are always along the first dimension, but you seem to be assuming that anyway, right?

bkamins · December 30, 2019, 11:31pm

Perhaps @bkamins can weigh in, since he was a major driver for the new syntax.

I have implemented the syntax change, but as you probably know mainly @nalimilan and I try not to force our own opinion, but rather listen to what users want and weigh it against possible design issues.

A good point to start thinking about this issue is not indexing but taking any function like filter, unique or sort and ask yourself do you expect it to work on rows or on columns. Most people conclude that row-oriented approach is more natural.

Of course for some cases, like select column-oriented approach is an obvious choice, but even there, if you allow column transformations there is a debate if it should expect functions to get whole columns or rows (see Allow rename when selecting by innerlee · Pull Request #1975 · JuliaData/DataFrames.jl · GitHub for the discussion and conclusions).

Next in broadcasting a data frame is most naturally viewed as a 2-dimensional object, so that e.g. ismissing.(df) returns a data frame with boolean columns that indicate places where missing values in df were present.

Given these considerations we did not find a single best option. In general DataFrame is treated as row-oriented for a majority of functions (this is a decision @nalimilan fixed some time ago), but we deliberately currently disallow iterating over a data frame. This is also the reason why df[i] is disallowed - data frame is a 2-dimensional object and requires two indices (in theory - and in the past - of course df[i] could be and was allowed, but disallowing it makes it explicit to the users what is the mental model of a data frame - a 2-dimensional object; also we believe that through this users become more aware what happens when they write df[:, i] vs df[!, i], in the past df[i] syntax lead to many bugs in user code just because casual users of DataFrames.jl did not understand fully what it was doing).

However, being able to iterate over rows and over columns of a data frame is useful. That is why eachrow and eachcolumn functions are provided (exactly like in Base for arrays). Note that not only those wrappers are <:AbstractArray but also they provide selected functionalities of a data frame (like supporting getproperty, they will be conforming to Tables.jl interface in 0.20 release, they print like a data frame).

juliohm · December 31, 2019, 12:28am

Thank you @bkamins for joining the discussion.

Fully agree.

Let me see if I understand. Broadcasting needs to be implemented in terms of the getindex method that we are discussing here?

I don’t quite follow the conclusion. I like the fact that in general DataFrame is treated as row-oriented data for the majority of the functions, and that is why I thought we could include getindex to this family. Disallowing this default iteration mode doesn’t seem to help making algorithms generic as discussed above.

I disagree with this point of view. A dataframe is most useful when we have a collection of observations with identified variable names. The entities involved are entire observations and not pieces of the observation to be indexed with two indices. Nevertheless, my suggestion doesn’t invalidate the two-indices view as both can co-exist.

I understand that these are equivalent to the Tables.jl effort with Tables.rows and Tables.columns? I like their existence, but my point is that writing generic algorithms for finite collections of samples is cumbersome without an agreed interface for finite collection of samples. As suggested above, this interface encompasses the most natural view of dataframes as a collection of row-observations, vectors of vectors, and custom types. If we could agree on such interface, that would be very useful to package developers.

So coming back to the original question: 1) is there an official interface for finite collections of objects in the language with a function for the number of objects in the collection (e.g. length)? If the answer is yes, what is the name of the interface? If the answer is not, can we consider one?

Take my motivating code snippet as an example to consider the interface:

function kernel_matrix(xs, ys)
  m = length(xs)
  n = length(ys)
  K = zeros(m, n)
  for i in 1:m, j in 1:n
    K[i,j] = kernel(xs[i], ys[j])
  end
  K
end

And expand it to various similar algorithms operating on finite collections of objects.

Tamas_Papp · December 31, 2019, 8:49am

This approach turned out to be problematic for a lot of types, because once there are multiple options, “natural” starts to become very subjective, leading to bugs, misleading expectations, and endless arguments about them, with people trying to convince others that their approach is “natural”, like you are doing here.

Eg consider associative collections (eg Dict): do you want to iterate over keys, values, or pairs of them?

The right solution is not to pick a “natural” one, but allow the user to be explicit about intent. Moreover, Julia allows to make this abstraction costless, so it is preferred these days for API design.

tkf · December 31, 2019, 9:07am

As others (also me) are pointed out a few times, the answer is “yes, it is the iterator interface, especially the IteratorSize trait”.

But my guess is that it would be much more productive to stick with the concrete question: “should DataFrame be a row iterator?”

My hunch is that this is the source of confusion. I think the common expectation is that x[i, j] in Julia is not a synonym of x[i][j] (unlike Python/Numpy). So, it is impossible to support what you suggest without deviating from common Julia interface.

bkamins · December 31, 2019, 9:19am

No, broadcasting requires other methods (the topic is complex - see DataFrames.jl/src/other/broadcasting.jl at main · JuliaData/DataFrames.jl · GitHub for details, an implementation detail is that we process data column-wise in broadcasting but this is only for performance reasons). However, the key thing broadcasting requires is for data frame to have 2 axes, which implies that data frame must be a two-dimensional object (broadcasting does not require an object to be iterable in particular).

The problem is that in the past df[i] returned a column not a row in DataFrames.jl for many years and it does so also in other ecosystems users come from.

We have not made a decision if df[i] should be allowed in the future. If we allow it then returning a row is a natural thing to do. But until this happens:

df[i] is now deprecated and calling it returns a column. This deprecation will be present (this is a rough number, but the point is that in old user code we need to give information how users should update it).
then for some time df[i] should be strictly disallowed.

We (i.e. at least @nalimilan and I) feel that deprecation of old functionalities should be graceful and slow (otherwise we would get a lot of disappointed users). So this process will take at least a year from now. After we are done with the discussion what to do with df[i] can be reopened. Until this happens, in order not to confuse users the official statement is: data frame is 2-dimensional so it requires 2 indices for indexing.

Tables.rows will be the same as eachrow (this change is not released yet).

Tables.columns just returns a data frame. This function does not guarantee to return an iterable, while eachcol returns an iterable (and even more - it returns an AbstractVector).

The crucial point in the discussion is that for the time being we have not decided to make data frame to be iterable, as it is a very committing decision. I agree that it would be nice to have a nicely workable iterable approach to tabular data, but that is why we provide eachrow and @tkf recently has improved usability of this.

The problem with making a data frame iterable that in generic code people often assume that length(A) == prod(size(A)) (which is satisfied for arrays) and that if the collection is iterable it should support lastindex like lastindex(A), but defining it was problematic for a data frame (if we wanted it to be a last row it is something like df[end, :] currently but it is not one index but two indices).

juliohm · December 31, 2019, 10:46am

Thank you @bkamins, I will see what I can do meanwhile to write these algorithms. If the discussion comes up again on GitHub in the future, could you please share in this thread? I really think we would benefit from a default row-oriented iteration mode.

I disagree with your view point @Tamas_Papp. “Natural” here is not something subjective as you said, it is something people expect when they choose a data structure. Dataframes were designed with statistical applications in mind. Specifically, they were created to represent collections of observations with identified variable names. If something deviates from this concept, it shouldn’t be called a dataframe. Your argument relies on the assumption that choosing a default iteration mode compromises other iteration modes. This assumption does not hold. eachrow and eachcol will exist in the same way and users will be free to pick what they want.

Tamas_Papp · December 31, 2019, 11:44am

Somewhat ironically, this very discussion is a demonstration that people can easily disagree about what is “natural”

You misunderstand my point — I am not claiming that picking a default iteration precludes others, just that when there are multiple options, it is better not to pick a “natural” one without a compelling reason, and the user should have to ask for iterables explicitly.

Signaling intent is usually better style anyway: someone reading

for r in obj

would have to recall how obj iterates (again, when there are multiple possibilities), while

for r in eachrow(obj)

is clear.

Nevertheless, iterating over “rows” of the dataframe (or, more generally, something supporting the excellent Tables.jl interface) is not the only valid way of iteration imaginable. Users also frequently iterate over “columns” (eg to obtain a summary), neither much more “natural” than the other.

Also, please recall that in R (arguably the mother of all data frame use cases) default iteration is by columns:

> df = data.frame(a = 1:10, b = 11:20)
> sapply(df, mean)
   a    b 
 5.5 15.5

juliohm · December 31, 2019, 11:51am

Tamas_Papp:

Signaling intent is usually better style anyway: someone reading
for r in obj
would have to recall how obj iterates (again, when there are multiple possibilities), while
for r in eachrow(obj)
is clear.

Yes, it is clear. It is also not generic. Now the user passes a vector of vectors and eachrow is undefined. Now the user passes a custom type and one has to figure out what is the name of the function to iterate over samples.

The conclusion is that R has a terrible design, a very common conclusion in various threads where the language is cited.

bkamins · December 31, 2019, 12:27pm

A small note - eachrow is defined in Base, so actually it is defined also for vectors:

julia> x = [[1,2], [3,4]]
2-element Array{Array{Int64,1},1}:
 [1, 2]
 [3, 4]

julia> for v in eachrow(x)
       println(v)
       end
Array{Int64,1}[[1, 2]]
Array{Int64,1}[[3, 4]]

As for the discussion of row-orientation in DataFrames.jl I think that crucial decisions will be made in https://github.com/JuliaData/DataFrames.jl/issues/2053. I have CCed you there so you will get notifications if something moves forward there.

heliosdrm · December 31, 2019, 12:42pm

Many people working in statistics may come from R. In that language data frames are implemented as lists of columns, so for those people the structure provided by DataFrames.jl may result quite natural.

Topic		Replies	Views
`length.(AbstractArray[])` and `size.(AbstractArray[])` return empty arrays of `Any` General Usage	19	552	March 11, 2022
How to find the size of a Set? New to Julia	12	5194	November 2, 2021
AbstractVector vs AbstractRange: length vs size -- which is fundamental? Internals & Design question , array , interface , range	3	1119	July 28, 2022
Using `end` when `size` is not definable General Usage	2	740	December 26, 2016
(why) is there no Abstract Collection type General Usage array , type , tuple , range	11	863	November 4, 2022

What is the agreed abstraction for the "size" of finite collections?

Related topics