Release announcements for DataFrames.jl

FWIW, I think it would really be worth adding a Project + Manifest to that repo. Then there is no problem getting the exact versions that were used when running the notebooks.

3 Likes

Good point - I will add this (I started doing these tutorials before Project.toml was an option).

2 Likes

If you have a few minutes to spare, @bkamins, could you cast your expert eye over the wiki books chapter to see if I’ve made any howlers? It dates back a long way (2015) but I’m sentimental and don’t want to just delete the whole section and point to your (much better) material. (Besides, I’d have to redo all the links. :scream:)

(Just mention any problems, don’t waste time learning the Wikibooks markup language . :joy:)

3 Likes

Hi All,

After almost half a year from 0.19 release we made it to 0.20. It is really a big release (79 PRs merged since 0.19).

Firstly, I would like to thank all that contributed to it. A chief person to mention is @nalimilan who constantly has been curating the package. The number of contributors in the period was so large that when I thought about listing all who were involved in the transition from 0.19 to 0.20 it was really hard. I have managed to pull out two groups of logins from GitHub (and apologies if I missed someone - I will gladly correct the list; I dropped @ in front of the logins as Discourse disallows me to mention them all directly in this post, as there are too many of them :smile:):

  • people who opened a PR that was merged for 0.20 release: aminya, ararslan, asinghvi17, dmolina, Ellipse0934, jlumpe, kojix2, laborg, nalimilan, nilshg, pdeffebach, petershintech, quinnj
  • people who opened an issue that was discussed for 0.20 release: anandijain, ChrisRackauckas, clintonTE, clynamen, Codsilla, daisy12321, davidanthoff, del2z, Drvi, eperim, evveric, ExpandingMan, felluksch, grahamgill, ianshmean, jablauvelt, juliohm, kescobo, mattBrzezinski, nicoleepp, oschulz, oxinabox, PharmCat, pmarg, pmcvay, proudindiv, pstaabp, rapus95, ronisbr, scls19fr, SimonEnsemble, stakaz, tlienart, ufechner7, waweruk2001, xiaodaigh

Both lists were impressive for me (I have not expected such a big contributing community), so it seems that DataFrame.jl is going strong. Thank you all for working on it.

Now, as this release is so big I am summarizing here only the major changes from 0.19 to 0.20:

  • some functions (join, groupby and show-related) now check
    if data frames passed to them are internally consistent;
    this should help users in catching bugs early
  • Indexing changes
    • it is now allowed to create new columns using : as row selector
    • broadcasting into an element of a DataFrame is now correct
    • when creating a column using broadcasting the new column always has
      the number of rows equal to number of rows of a data frame before
      the operation
    • broadcasting over GroupedDataFrame is now reserved
    • df[!, cols] is now allowed in setindex! and broadcasting assignment
    • generators are now not allowed as left hand side of assignment operation
      to DataFrameRow
  • describe now shows actual eltype of the column
  • added allowmissing, disallowmissing and categorical functions
    for data frame objects
  • cleaning up code to avoid throwing more specific error types
  • unstack now allows for renaming of the columns
  • Tuple as a value of on keyword argument in join is now deprecated
    (use Pair instead)
  • Between, All and Not are allowed in indexing
  • DataFrameRows and DataFrameColumns now have custom show methods
  • DataFrameRows and DataFrameColumns support getproperty now
  • columnindex from Tables.jl is now exported
  • categorical! now allows types as cols argument
  • we no longer use makeunique=true for grouping keys in combine and map
  • by now has skipmissing keyword argument
  • redesign push!, append! and vcat to make them more consistent
  • mapcols now never reuses source columns without copying
  • significantly improved sort performance
  • disallowmissing! and disallowmissing now accept error keyword argument
  • select and select! now allow passing multiple columns arguments
  • permutecols! is deprecated (use select! instead)
  • fixed a bug in hash handling in row_group_slots
  • fully switched to Travis only CI
  • join now accepts more than two data frames to be joined
  • rename! now allows permutation of column names in renaming
  • names! is deprecated (use rename!)
  • aggregate now accepts skipmissing
  • gropuby now accepts cols argument to be an empty vector
  • copycols in DataFrame constructors is now more permissive
    (it does not error when it is not possible not to copy when copycols=false)
  • join now allows mixing Symbol and Pair{Symbol, Symbol} in on keyword argument
  • added flatten function
  • views now disallow duplicate columns
  • describe now has cols keyword argument
  • add conversion to Array for data frame and data frame row
  • rename! and rename now accept strings and integers (apart from Symbols) for renaming
  • io support in describe is deprecated
  • melt is now deprecated (use stack with Not selector instead)
  • stackdf and meltdef are now deprecated (use view=true in stack insteead)
  • GroupedDataFrame now supports keys with information on values of grouping columns;
    this is also allowed in GroupedDataFrame indexing (also Tuple or NamedTuple are allowed)
  • switch to strict upper bounds of package compatibility

Finally, we are fairly close to 1.0 release. Here is a list of current to-do things for 1.0, so as you can see it is not very long (probably it will grow a bit on the way). I plan to have a 0.21 release as a beta before 1.0.

EDIT
The DataFrames Tutorial is now updated and reflects the changes in 0.20.0 release.

44 Likes

6 posts were split to a new topic: DataFrame join error

Thank you to @bkamins, @quinnj, @nalimilan and other contributors for the continued updates. Over the last few months, updates to DataFrames, CSV and 1.2->1.3 have led to about an 80% speed increase in my data-intensive code, with gains spread fairly evenly across different types of IO operations, computations, and data manipulations.

23 Likes

Whoa… eye-popping improvement! I compared pandas and DataFrames/CSV just now. For 1GB csv file reading, pandas took 13 seconds while DataFrames/CSV took only 1.8 seconds. Thank contributors very much. You are the geniuses.

13 Likes