DataFrames: why is `df[2,]` the same as `df[2]`?

Curious about the rationale for this. Seems like df[2,] should either be the second row (same as df[2,:]) or an error.

Thanks

In general, Julia allows optional trailing commas in parens, brackets, etc. I personally find languages that don’t allow optional trailing commas quite fussy and annoying. The only case where this is significant is to distinguish parenthesized expressions like (x+1) from a one-tuple like (x+1,).

2 Likes

This is not DataFrame-specific, rather a feature of parsing:

julia> a = reshape(1:9, 3, :)
3Ă—3 Base.ReshapedArray{Int64,2,UnitRange{Int64},Tuple{}}:
 1  4  7
 2  5  8
 3  6  9

julia> a[2,:]
3-element Array{Int64,1}:
 2
 5
 8

julia> a[2,]
2

julia> a[2]
2

julia> expand(:(a[2,]))
:(getindex(a, 2))

julia> expand(:(a[2,:]))
:(getindex(a, 2, :))

Given that missing a : is easy if you come from different languages which don’t need it (eg R), this can lead to silly mistakes, getting a completely different object (an element instead of a slice).

When teaching Julia to students with a background in another language, I saw them running into this frequently. Please consider opening an issue.

In addition to the indexing behavior common to arrays and data frames, the fact that df[2] returns the second column of a data frame (both in R or in Julia) is not totally logical and might change. This could reduce the confusion. See discussions in this issue and this one.

3 Likes

I’m not sure if it’s relevant, but I really like the Columns type from IndexedTables, where you can easily extract a column but also iterate on rows and both operation are efficient:

julia> using IndexedTables

julia> df = Columns(x = ["a", "b"], y = [1, 2])
2-element IndexedTables.Columns{NamedTuples._NT_x_y{String,Int64},NamedTuples._NT_x_y{Array{String,1},Array{Int64,1}}}:
 (x = "a", y = 1)
 (x = "b", y = 2)

julia> columns(df,:x)
2-element Array{String,1}:
 "a"
 "b"

julia> for i in df
           println(i)
       end
(x = "a", y = 1)
(x = "b", y = 2)

In particular the row iterator (of named tuples) is useful because it makes it very easy to select rows, which at the moment is a bit clumsy on a DataFrame without extra packages. I think it’d be really cool to have a unified interface with similar features in terms of column extraction and row iteration across DataFrames and Columns.

1 Like

We basically have this interface in DataFrames already:

julia> using DataFrames

julia> df = DataFrame(x = ["a", "b"], y = [1, 2])
2Ă—2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1   │ a │ 1 │
│ 2   │ b │ 2 │

julia> df[:x]
2-element Array{String,1}:
 "a"
 "b"

julia> for row in eachrow(df)
           println(row)
       end
DataFrameRow (row 1)
x  a
y  1

DataFrameRow (row 2)
x  b
y  2

The two differences are:

  1. We require eachrow, which has the advantage of being explicit (you could also want to iterate over columns).
  2. We return a DataFrameRow object, but this will likely change once NamedTuples are added to Base. That will also improve performance by storing type information (which currently does not happen with DataFrameRow).

Though I agree a unifying interfaces would be nice.

I see. Now I remember that there is a row iteration but it has performance issues due to missing type information. Still, on the longer term, if there are no conceptual reasons why column/row extraction/iteration should behave differently in Columns and DataFrame, maybe the Columns type could be moved to a separate package, the interface could be changed to something everybody agrees to and, at least in my mind, this interface would be the minimum requirement for “table like structure”.

I can see the issue for people coming from R, but this is just going to get closed as “won’t fix”. Julia allows trailing commas wherever possible – we’re not going to change that.

An argument in favor of trailing commas is that it makes it much easier to insert/remove lines of code without worrying about the comma at the end of the previous line. For example, the following is valid Julia code even though the third element of the vector is commented out:

a = [
  1,
  2,
# 3 
  ]
1 Like

My understanding is that Julia allows a single trailing comma, eg

julia> ones(3,3)[1,]
1.0

julia> ones(3,3)[1,,]
ERROR: syntax: unexpected ,

julia> (1,)
(1,)

julia> (1,,)
ERROR: syntax: unexpected ,

julia> getindex(ones(3,3), 1, )
1.0

julia> getindex(ones(3,3), 1, ,)
ERROR: syntax: unexpected ,

Can you explain the rationale behind this decision? I can only see the downside, because I would assume that a dangling comma must be the result of a variable I missed accidentally, but since you are sure of wontfix, there must be a benefit I am not seeing.

Commas are delimiters for elements in list-like things. Trailing commas are nice to support for exactly the reason that Per mentions. I know when I am working with large SQL queries or JSON objects I frequently have to go back and add/remove those trailing commas after re-arranging the order of the SELECT or some such. It means that you can have an exact 1-1 correspondence with the item and its delimiter.

Two commas without an intervening item are definitely evidence of an error. A final one? It’s just handy.

2 Likes

Still, you had to introduce extra linebreaks (at least for the [ and the ], I assume). I find

[1, 2#=, 3=#]

somewhat more convenient.

The reason why i used linebreaks is that I wanted to give an example of a multi-line expression…

I’ve been letting the responses to this thread rolling around in my head for a couple days, and FWIW, a few thoughts:

@StefanKarpinski

I personally find languages that don’t allow optional trailing commas quite fussy and annoying

This strikes me as an odd reason to allow a problem that may potentially lead to (silent) errors in user code. I see how it can make it easier to comment out code (as noted by @Per), but this seems like on context where the many scientific computing users coming from R are likely to really get in trouble and, say, aggregate across columns when they mean to aggregate across rows.

I find this tradeoff particularly problematic because Julia is a scientific programming language, a domain where I think defensive programming is especially important. If an error sneaks into an iPhone app, it (a) is likely to be caught since it will probably result in unexpected behavior, and (b) even if it gets deployed, it can be fixed in the next release. But scientific computing in fundamentally about computing unknown quantities, which means it’s not always possible to look at the output of an analysis and know if the result is real or the result of a bug – one that could easily make it into a paper and lead to “bad knowledge” or, if later caught, negative career consequences for the scholar. Thus I would hope Julia would put a big emphasis on preventing silent errors, especially when there aren’t performance tradeoffs.

In a similar vein, I recently heard an interview with Guido van Rossum where he said one guiding principle of Python development was “never blame the user”. If classrooms of users are making a mistake here (as suggested @Tamas_Papp, and I would certainly expect to see among my students), maybe that’s a problem with the language and not the users?

The only case where this is significant is to distinguish parenthesized expressions like (x+1) from a one-tuple like (x+1,).

It sounds like an exception has been made for tuples – maybe array indexing can be another exception?

Finally, a note on:

@StefanKarpinski

I can see the issue for people coming from R, but this is just going to get closed as “won’t fix”. Julia allows trailing commas wherever possible – we’re not going to change that.

In general, I have found the Julia community to be wonderful and exceedingly welcoming, but this feels like a bit of an excessive shutdown. If this has been subject to debate in the past, might I suggest a link to the past discussion? And if not, might I suggest declaring the outcome of a discussion pre-emptively is a little excessive?

The key is this (emphasis mine):

I think this just needs to be called out in the notable differences from other languages section (if it’s not already). Julia is a new and different language. Given the varied history of programming languages and similarly varied backgrounds of folks coming to Julia, there’s often not a solution that is equally obvious to all people. As a non-R person, R’s behavior here strikes me as terrifyingly subtle and surprising.

It’s also worth noting that tuples are also used to represent variable-length argument lists… and indexing is just a very simple syntax transformation for calling getindex with a variable-length argument list. These decisions aren’t really uncoupled special cases.

It always hurts when you burn lots of time on a behavior you find surprising. We really do try to minimize those as much as possible. In some cases, though, there isn’t a choice that’s equally obvious to all folks. I’m reminded of this Stack Overflow question about different interpretations of the a:b syntax.

4 Likes

This wasn’t intended as a shut down – it’s to prevent you from wasting your time opening an issue for something that isn’t going to happen. To put this in perspective:

  1. This would be a very disruptive change – a lot of code in the wild would need to be fixed, much of which is not on GitHub and therefore cannot be automatically updated.
  2. This is a very R-specific feature that no other mainstream language I’m aware of has.
  3. In the five years that Julia has been public, I don’t think this has ever come up before.
  4. This is very much the 11th hour. Julia has been under development for eight years and Julia 1.0 is imminent – we are not considering basic syntax changes at this point.

The argument that this is a dangerous feature seems pretty overstated. Expecting that passing nothing has a distinct meaning from not passing anything actually strikes me as the more dangerous behavior. It should, however, certainly be documented as a difference from R.

It sounds like an exception has been made for tuples – maybe array indexing can be another exception?

The exception in that case exists because parenthesizing expressions has a fairly well-established meaning (especially in mathematical expressions), which is different from constructing a 1-tuple. If there were more ASCII bracket pairs, we could use a different one for constructing tuples, but alas, there are only three and so parens ends up somewhat overloaded.

2 Likes

OK, I appreciate that, and recognize you’re trying to quickly deal with a lot of open issues with brevity. I just wanted to flag how it read. The summary of concerns put here seem very reasonable, and suggest your conclusion was right. It just wasn’t clear what was motivating the statement.

That I can certainly understand, and likely makes the conversation moot. Thank you for your clarification on that.

I guess I’m less clear this is a case of that. Yes, it’s R-specific, but it seems like lots of Julia converts come from Julia, and this seems like a situation where a informative error (analogous to the delightful message that comes up when one tries to exponentiate with **) would solve the problem without really introducing problems (though again, I’m open to the possibility there’s no way to carve out an exception for array indexing – just want to have the discussion).

I don’t think it is – will work on a pr.

Put in a PR for docs (#23907)