Release announcements for DataFrames.jl

There is already a PR for an empty data frame case here https://github.com/JuliaData/DataFrames.jl/pull/1890.

In consequence writing:

df = DataFrame()
df[!, :col] .= 1

will be possible but will create a 0-element vector. The reasoning behind it is that df has 0-rows so adding a column to it should keep it this way.

Do you think a data frame with zero columns should have a different behavior than an empty data frame with at least one column here?

Also - is this case really a significant one? (I am aware that it might be popular when learning the package but does it happen in real cases).

Also note that you can do the following push!(df, (a=1,)), where df was created using DataFrame().

1 Like

Just for a reference. You can assign to a new column in general like:

df = DataFrame(a=[1,2,3])
df[!, :b] .= 1

works OK. The only problematic cases were zero column data frame and an empty data frame with more than zero columns (they are currently handled in the same way in the PR I have mentioned).

Thanks for the explanation. In your example, my naive expectation would be that df[!,:col] .= 1 creates a column of Int64[] called col with a single element of 1:

 df
  1Γ—1 DataFrame
β”‚ Row β”‚ a     β”‚
β”‚     β”‚ Int64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1     β”‚

It is somewhat of an edge case. Although I would expect it to create a new column and assign 1, that may not be consistent with the current behavior, where it would create a column of size R where R is the number of rows. For example:

using DataFrames
df = DataFrame()
R = 10
df[!,:a] = rand( R)
df[!,:b] .= .4

df
  10Γ—2 DataFrame
β”‚ Row β”‚ a        β”‚ b       β”‚
β”‚     β”‚ Float64  β”‚ Float64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 0.859254 β”‚ 0.4     β”‚
β”‚ 2   β”‚ 0.931467 β”‚ 0.4     β”‚
β”‚ 3   β”‚ 0.276189 β”‚ 0.4     β”‚
β”‚ 4   β”‚ 0.695778 β”‚ 0.4     β”‚
β”‚ 5   β”‚ 0.436747 β”‚ 0.4     β”‚
β”‚ 6   β”‚ 0.478671 β”‚ 0.4     β”‚
β”‚ 7   β”‚ 0.220407 β”‚ 0.4     β”‚
β”‚ 8   β”‚ 0.778795 β”‚ 0.4     β”‚
β”‚ 9   β”‚ 0.614729 β”‚ 0.4     β”‚
β”‚ 10  β”‚ 0.825508 β”‚ 0.4     β”‚

I’m not entirely certain what should be done when R = 0 or R is undefined because the dataframe is empty On one hand, it seems odd to create an empty column when a specific value is being supplied, but assigning the value does not seem entirely consistent with the behavior above either. I guess, between the two options, I might choose to have it create the column and assign the value and use something like df[!,:a] .= Int64[ ] to create an empty column. Of course, that is my somewhat under-informed assessment, and I have not thought about all of the ramifications.

The current behavior is slightly inconvenient because I have to wrap scalars in an array, at least when creating the initial column, e.g. df = DataFrame(); df[!,:a] = [.3]. After that, as you pointed out in your second comment to me, it is possible to assign values subsequently. That’s not a big deal if I always know where the first column will be created. What is more problematic is if v could be a scalar or vector, and the DataFrame is empty. As far as I know, this would require two sets of syntax to handle both cases, whereas before, it handled the assignment of both scalars and vectors.

I agree with you. I’m not sure how often this problem arises in actual code. I have encountered it from time to time in my work. This occurs when I need to extract data from various sources and integrate it with some container iteratively and flexibly (i.e. I may not know what the columns will be or what their sizes will ultimately be).

P.S. I couldn’t adapt your example with push to my problem:

Case 1: scalar

df = DataFrame()
a = 1
push!(df,(a=a,))
 df
  1Γ—1 DataFrame
β”‚ Row β”‚ a     β”‚
β”‚     β”‚ Int64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1     β”‚

Case 2: vector

df = DataFrame()
a= [1,2]
push!(df,(a=a,))
 df
  1Γ—1 DataFrame
β”‚ Row β”‚ a      β”‚
β”‚     β”‚ Array… β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ [1, 2] β”‚

I tried various placements of the "..." operator, but it would not accommodate both cases.

We could treat DataFrame() as having undefined number of rows (as opposed to 0 rows).
Thinking about it we then should also allow:

df = DataFrame()
df[!, :c] .= [1,2,3]

(and create a 3 element vector - now it throws an error; I will update the PR I have mentioned above and we can discuss it further there)

PS. push! is only for scalars

1 Like

Thanks for your time. I found the discussion helpful for understanding DataFrames.

I have implemented the requested changes in https://github.com/JuliaData/DataFrames.jl/pull/1890. The functionality is tricky so if someone can have a look at it it would be appreciated.

2 Likes

Thank you. I will take a look.

Based on the feedback after 0.19.0 release we have a patch 0.19.1 release that introduces the following changes:

  • we drop StatsBase.jl dependency, which improves DataFrames.jl load time
  • push! and append! now make sure they do not produce an output data frame with unequal number of rows in columns (this was a common problem causing hard-to-catch bugs)
  • join, groupby and show-related functions now check if the data frame they work on is not internally corrupted (most often the problem is unequal number of rows in columns); this allows users to track possible column aliasing related bugs in their code easier
  • broadcasting now allows broadcasting into 0-row (empty) data frame and correctly handles broadcasting into a single cell of a data frame.

Thanks for all that reported problems and contributed!

You can see the detailed release notes here.

12 Likes

We have just released a 0.19.2 patch release. The full log of changes can be found here.

The most relevant changes are:

  • disallowmissing , allowmissing and categorical functions were added
  • unstack now accepts renamecols keyword argument providing a flexible mechanizm to generate column names of an unstacked data frame
4 Likes

I found some weird syntax in DataFrame 0.19.2… is it me doing wrong (i.e. β€œis there a better way to do it”) or it is supposed to be like this ?

Let’s assume the following df:

df = DataFrame(a=[1,2,3],b=[4,5,6],c=[7,8,9])

# Change column order
df = df[[:b, :a, :c]]    # before
df = df[:,[:b, :a, :c]]  # now

# Operate on a column whose name is memorised in a variable
C = :c; eltype(df[C])   # before 
C = :c; eltype(df[!,C]) # now

# Delete cols
deletecols!(df,:b)   # before
select!(df,Not(:b))  # now

What the new ! symbol stand for ? How df[!,:a] is different than df[:,:a] ?

! roughly stands for β€œa direct access to a column stored in a data frame”, as opposed to : which stands for β€œaccess to a copy of a column stored in a data frame”.

The detailed rules are laid out here.

The reason for this change is described above in this thread (around post 51), but in short a direct access to a column lead to many nasty bugs when using DataFrames.jl package (people accessed the column, then mutated it and they were not aware that they also mutated the contents of the source data frame). The ! symbol adds a bit of verbosity, but is meant to warn that the operation is potentially mutating (so this similar with the meaning of ! in Base in function names).

Note that instead of df[!, :col] you can just write df.col.

Also note that:

  • to change the order of columns you can use the permutecols! function which works in place; if you want to avoid copying of columns and create a new data frame use df[!, [:b, :a, :c]] (this is potentially unsafe though - as explained above)
  • deletecols! was deprecated because select! just does the job with Not indexing, so there would be a duplication of functionality (and select-family of functions is something that many people already know from other data processing ecosystems).
6 Likes

I write to say that in my opinion the new syntax is nice. The symbol ! inside the selection is a bit strange at first, but after you got use it has a lot of sense. The other changes are also good in mi opinion.

6 Likes

I have a small announcement. DataFrames Tutorial is for some time in sync with DataFrames.jl 0.19 release. But I have just added in the 04_loadsave.ipynb notebook examples of integration with JSONTables.jl.

@quinnj has done a fantastic job with JSONTables.jl and with the release 0.1.2 of this package you can safely read and write JSON data (both row- and column- oriented) to a DataFrame.

4 Likes

FWIW, I think it would really be worth adding a Project + Manifest to that repo. Then there is no problem getting the exact versions that were used when running the notebooks.

3 Likes

Good point - I will add this (I started doing these tutorials before Project.toml was an option).

2 Likes

If you have a few minutes to spare, @bkamins, could you cast your expert eye over the wiki books chapter to see if I’ve made any howlers? It dates back a long way (2015) but I’m sentimental and don’t want to just delete the whole section and point to your (much better) material. (Besides, I’d have to redo all the links. :scream:)

(Just mention any problems, don’t waste time learning the Wikibooks markup language . :joy:)

3 Likes

Hi All,

After almost half a year from 0.19 release we made it to 0.20. It is really a big release (79 PRs merged since 0.19).

Firstly, I would like to thank all that contributed to it. A chief person to mention is @nalimilan who constantly has been curating the package. The number of contributors in the period was so large that when I thought about listing all who were involved in the transition from 0.19 to 0.20 it was really hard. I have managed to pull out two groups of logins from GitHub (and apologies if I missed someone - I will gladly correct the list; I dropped @ in front of the logins as Discourse disallows me to mention them all directly in this post, as there are too many of them :smile:):

  • people who opened a PR that was merged for 0.20 release: aminya, ararslan, asinghvi17, dmolina, Ellipse0934, jlumpe, kojix2, laborg, nalimilan, nilshg, pdeffebach, petershintech, quinnj
  • people who opened an issue that was discussed for 0.20 release: anandijain, ChrisRackauckas, clintonTE, clynamen, Codsilla, daisy12321, davidanthoff, del2z, Drvi, eperim, evveric, ExpandingMan, felluksch, grahamgill, ianshmean, jablauvelt, juliohm, kescobo, mattBrzezinski, nicoleepp, oschulz, oxinabox, PharmCat, pmarg, pmcvay, proudindiv, pstaabp, rapus95, ronisbr, scls19fr, SimonEnsemble, stakaz, tlienart, ufechner7, waweruk2001, xiaodaigh

Both lists were impressive for me (I have not expected such a big contributing community), so it seems that DataFrame.jl is going strong. Thank you all for working on it.

Now, as this release is so big I am summarizing here only the major changes from 0.19 to 0.20:

  • some functions (join, groupby and show-related) now check
    if data frames passed to them are internally consistent;
    this should help users in catching bugs early
  • Indexing changes
    • it is now allowed to create new columns using : as row selector
    • broadcasting into an element of a DataFrame is now correct
    • when creating a column using broadcasting the new column always has
      the number of rows equal to number of rows of a data frame before
      the operation
    • broadcasting over GroupedDataFrame is now reserved
    • df[!, cols] is now allowed in setindex! and broadcasting assignment
    • generators are now not allowed as left hand side of assignment operation
      to DataFrameRow
  • describe now shows actual eltype of the column
  • added allowmissing, disallowmissing and categorical functions
    for data frame objects
  • cleaning up code to avoid throwing more specific error types
  • unstack now allows for renaming of the columns
  • Tuple as a value of on keyword argument in join is now deprecated
    (use Pair instead)
  • Between, All and Not are allowed in indexing
  • DataFrameRows and DataFrameColumns now have custom show methods
  • DataFrameRows and DataFrameColumns support getproperty now
  • columnindex from Tables.jl is now exported
  • categorical! now allows types as cols argument
  • we no longer use makeunique=true for grouping keys in combine and map
  • by now has skipmissing keyword argument
  • redesign push!, append! and vcat to make them more consistent
  • mapcols now never reuses source columns without copying
  • significantly improved sort performance
  • disallowmissing! and disallowmissing now accept error keyword argument
  • select and select! now allow passing multiple columns arguments
  • permutecols! is deprecated (use select! instead)
  • fixed a bug in hash handling in row_group_slots
  • fully switched to Travis only CI
  • join now accepts more than two data frames to be joined
  • rename! now allows permutation of column names in renaming
  • names! is deprecated (use rename!)
  • aggregate now accepts skipmissing
  • gropuby now accepts cols argument to be an empty vector
  • copycols in DataFrame constructors is now more permissive
    (it does not error when it is not possible not to copy when copycols=false)
  • join now allows mixing Symbol and Pair{Symbol, Symbol} in on keyword argument
  • added flatten function
  • views now disallow duplicate columns
  • describe now has cols keyword argument
  • add conversion to Array for data frame and data frame row
  • rename! and rename now accept strings and integers (apart from Symbols) for renaming
  • io support in describe is deprecated
  • melt is now deprecated (use stack with Not selector instead)
  • stackdf and meltdef are now deprecated (use view=true in stack insteead)
  • GroupedDataFrame now supports keys with information on values of grouping columns;
    this is also allowed in GroupedDataFrame indexing (also Tuple or NamedTuple are allowed)
  • switch to strict upper bounds of package compatibility

Finally, we are fairly close to 1.0 release. Here is a list of current to-do things for 1.0, so as you can see it is not very long (probably it will grow a bit on the way). I plan to have a 0.21 release as a beta before 1.0.

EDIT
The DataFrames Tutorial is now updated and reflects the changes in 0.20.0 release.

43 Likes

6 posts were split to a new topic: DataFrame join error

Thank you to @bkamins, @quinnj, @nalimilan and other contributors for the continued updates. Over the last few months, updates to DataFrames, CSV and 1.2->1.3 have led to about an 80% speed increase in my data-intensive code, with gains spread fairly evenly across different types of IO operations, computations, and data manipulations.

15 Likes