I might have missed this but why is `head(df::AbstractDataFrame)` deprecated?

I do not think it is a major blocker :smile:. I think of it as one of these things that are on the border of my personal preference and I acknowledge that people might have diverging opinion on it (that is why when you proposed deprecating it I was OK with this, as you had “big pro deprecation” opinion while I had “very small con deprecation” opinion).

This issue is really minor (but I am OK to discuss it if people want). Compare this to conflicting opinions how plotting should be implemented :smile:.

6 Likes

No, it’s just a little annoying in day-to-day use.

I don’t understand why one would prefer legacy over generality. If there are existing functions that are used all over Base or other packages that make perfect sense in the context of a dataframe, why prefer different names? Just because they are legacy from pandas or R? I could see an argument for the DataFrames use of first not being the same as in Base, but for example why not prefer size(df,1) over nrows(df) when size has a perfectly universal meaning in Julia, is it really that much more typing? I feel like nobody would ask for nrows for an array.

2 Likes

I think your example shows well why people might have different opinions about those (and it is IMO OK to differ).

Consider this comparison:

julia> using DataFrames

julia> x = rand(2,2)
2×2 Array{Float64,2}:
 0.127077  0.660575
 0.915064  0.8524

julia> df = DataFrame(x)
2×2 DataFrame
│ Row │ x1       │ x2       │
│     │ Float64  │ Float64  │
├─────┼──────────┼──────────┤
│ 1   │ 0.127077 │ 0.660575 │
│ 2   │ 0.915064 │ 0.8524   │

julia> size(x)
(2, 2)

julia> size(df)
(2, 2)

julia> first(x)
0.12707672347662347

julia> first(df)
DataFrameRow
│ Row │ x1       │ x2       │
│     │ Float64  │ Float64  │
├─────┼──────────┼──────────┤
│ 1   │ 0.127077 │ 0.660575 │
2 Likes

That’s a really good argument for first, it doesn’t seem to be a good argument for nrow or ncol.

FWIW, I think that

  1. first is a bad pun here: DataFrames per se are not iterable (one has to go through the Tables.jl interface to get a row iterator),

  2. it is not really a good idea to combine the extraction of the first row (a DataFrameRow) and the first n rows (a DataFrame) in a single interface function; that’s like a function sometimes returning a matrix, sometimes a vector,

  3. one should not necessarily introduce a function for either of these, but think about the purpose: the user usually just wants a preview of the dataframe without looking at the whole thing; so just make Base.show print the first and last 10-10 rows or so (currently I think it is 20-20). For the rest, there are the shiny new Base.getindex accessors.

That said, I think these things are the decision of the package maintainers, and the best place to discuss them is in issues. It is great to solicit input from users, but interfaces need a single architect, or possibly a small team of 2-3 individuals, but not more.

13 Likes

Exactly!

head, nrow and other single-argument functions are more convenient for interactive use:
DataFrame(...) |> ... some processing ... |> head
vs
DataFrame(...) |> ... some processing ... |> x->first(x, 5)

You can write the latter a bit more concise with Query.jl:

DataFrame(...) |> ... some processing ... |> @take(5)

In terms of “conceptual purity”, I agree that, if one needs eachrow to iterate rows, then getting the first row should be first(eachrow(df)) or df[1, :] (esp. since it seems DataFrames are a 2D container in that you need to indices to get a value, unlike for example IndexedTables). Why not simply df[1:6, :] to get the first few rows?

1 Like

I agree, but I still wonder what the use case is. In R, assignments do not print, and values print in full, so

> df <- data.frame(x = 1:100, y = 101:200)
> head(df)
  x   y
1 1 101
2 2 102
3 3 103
4 4 104
5 5 105
6 6 106
> df # and this prints 100 lines, not included

and head makes sense. But in Julia we can just make Base.show show a couple of rows from the beginning and the end (this is what is currently done, I would show fewer rows).

What’s the use case from head (or first(..., n)) when we have this?

2 Likes

Good point. I admit my use can can be reduced to the number of rows being printed in jupiter being too many

3 Likes

PrettyTables.jl is awesome here (requires one line of setup, but then @pt df)

2 Likes

I do agree that this is a printing issue rather than a function issue. R just spits a lot of data when printing a df and you do need head and tail to make sense of large dataframes. Those functions can be replaced with a nicer output, that does not overwhelm the REPL and allows a quick glance at the df to make sure things worked as expected. I understand a lot of us have a lot of “ticks” and muscle memory patterns coming from R, but when I thought about some of the desing choices in R, they really don’t make much sense, while most of them do in Julia.

3 Likes

OK, I’m starting to change my mind about this - I now agree with @Tamas_Papp here I might have missed this but why is `head(df::AbstractDataFrame)` deprecated? - #26 by Tamas_Papp

3 Likes

Have you considered making a small package DataFramesExtra in which you would dump things like ncol, nrow, head, tail and all these convenience utilities that people from R or Pandas expect to find. Instead of endless discussions, you could point people to the package, while “not recommending it”.

No. Have you?

This was a comment intended for Bogumił.

a

In fact I stopped seeing these discussions for many months now. However, incidentally, this week on my blogpost I planned to comment why we support nrow and ncol but not head and tail. The core of the issue is:

  • number of verbs the user has to learn
  • how likely a name clash is when we introduce some exported name
  • is there already a name in Julia Base that does a similar thing
2 Likes

I completely understand and just as Michael said, it’s just a matter of getting used to it. DataFrames is a fantastic package.

Personally I use R, Python, Julia, Maple, Matlab (and a few more) on and off as the need arises, have become skilled in none of them, and find it difficult to remember right away whether it’s head(df), df.head, first(df), or whatever.

The first step for me is usually:

julia> ?head

When that fails I google for “Julia DataFrames head”. This is how I got to this thread and no doubt will visit again in the future.

Might be useful to have ?head return something like “Couldn’t find head. Perhaps you are looking for first” rather than the current “Couldn’t find head. Perhaps you meant read, …”