Release announcements for DataFrames.jl

FWIW, I think it would really be worth adding a Project + Manifest to that repo. Then there is no problem getting the exact versions that were used when running the notebooks.

3 Likes

Good point - I will add this (I started doing these tutorials before Project.toml was an option).

2 Likes

If you have a few minutes to spare, @bkamins, could you cast your expert eye over the wiki books chapter to see if Iā€™ve made any howlers? It dates back a long way (2015) but Iā€™m sentimental and donā€™t want to just delete the whole section and point to your (much better) material. (Besides, Iā€™d have to redo all the links. :scream:)

(Just mention any problems, donā€™t waste time learning the Wikibooks markup language . :joy:)

3 Likes

Hi All,

After almost half a year from 0.19 release we made it to 0.20. It is really a big release (79 PRs merged since 0.19).

Firstly, I would like to thank all that contributed to it. A chief person to mention is @nalimilan who constantly has been curating the package. The number of contributors in the period was so large that when I thought about listing all who were involved in the transition from 0.19 to 0.20 it was really hard. I have managed to pull out two groups of logins from GitHub (and apologies if I missed someone - I will gladly correct the list; I dropped @ in front of the logins as Discourse disallows me to mention them all directly in this post, as there are too many of them :smile:):

  • people who opened a PR that was merged for 0.20 release: aminya, ararslan, asinghvi17, dmolina, Ellipse0934, jlumpe, kojix2, laborg, nalimilan, nilshg, pdeffebach, petershintech, quinnj
  • people who opened an issue that was discussed for 0.20 release: anandijain, ChrisRackauckas, clintonTE, clynamen, Codsilla, daisy12321, davidanthoff, del2z, Drvi, eperim, evveric, ExpandingMan, felluksch, grahamgill, ianshmean, jablauvelt, juliohm, kescobo, mattBrzezinski, nicoleepp, oschulz, oxinabox, PharmCat, pmarg, pmcvay, proudindiv, pstaabp, rapus95, ronisbr, scls19fr, SimonEnsemble, stakaz, tlienart, ufechner7, waweruk2001, xiaodaigh

Both lists were impressive for me (I have not expected such a big contributing community), so it seems that DataFrame.jl is going strong. Thank you all for working on it.

Now, as this release is so big I am summarizing here only the major changes from 0.19 to 0.20:

  • some functions (join, groupby and show-related) now check
    if data frames passed to them are internally consistent;
    this should help users in catching bugs early
  • Indexing changes
    • it is now allowed to create new columns using : as row selector
    • broadcasting into an element of a DataFrame is now correct
    • when creating a column using broadcasting the new column always has
      the number of rows equal to number of rows of a data frame before
      the operation
    • broadcasting over GroupedDataFrame is now reserved
    • df[!, cols] is now allowed in setindex! and broadcasting assignment
    • generators are now not allowed as left hand side of assignment operation
      to DataFrameRow
  • describe now shows actual eltype of the column
  • added allowmissing, disallowmissing and categorical functions
    for data frame objects
  • cleaning up code to avoid throwing more specific error types
  • unstack now allows for renaming of the columns
  • Tuple as a value of on keyword argument in join is now deprecated
    (use Pair instead)
  • Between, All and Not are allowed in indexing
  • DataFrameRows and DataFrameColumns now have custom show methods
  • DataFrameRows and DataFrameColumns support getproperty now
  • columnindex from Tables.jl is now exported
  • categorical! now allows types as cols argument
  • we no longer use makeunique=true for grouping keys in combine and map
  • by now has skipmissing keyword argument
  • redesign push!, append! and vcat to make them more consistent
  • mapcols now never reuses source columns without copying
  • significantly improved sort performance
  • disallowmissing! and disallowmissing now accept error keyword argument
  • select and select! now allow passing multiple columns arguments
  • permutecols! is deprecated (use select! instead)
  • fixed a bug in hash handling in row_group_slots
  • fully switched to Travis only CI
  • join now accepts more than two data frames to be joined
  • rename! now allows permutation of column names in renaming
  • names! is deprecated (use rename!)
  • aggregate now accepts skipmissing
  • gropuby now accepts cols argument to be an empty vector
  • copycols in DataFrame constructors is now more permissive
    (it does not error when it is not possible not to copy when copycols=false)
  • join now allows mixing Symbol and Pair{Symbol, Symbol} in on keyword argument
  • added flatten function
  • views now disallow duplicate columns
  • describe now has cols keyword argument
  • add conversion to Array for data frame and data frame row
  • rename! and rename now accept strings and integers (apart from Symbols) for renaming
  • io support in describe is deprecated
  • melt is now deprecated (use stack with Not selector instead)
  • stackdf and meltdef are now deprecated (use view=true in stack insteead)
  • GroupedDataFrame now supports keys with information on values of grouping columns;
    this is also allowed in GroupedDataFrame indexing (also Tuple or NamedTuple are allowed)
  • switch to strict upper bounds of package compatibility

Finally, we are fairly close to 1.0 release. Here is a list of current to-do things for 1.0, so as you can see it is not very long (probably it will grow a bit on the way). I plan to have a 0.21 release as a beta before 1.0.

EDIT
The DataFrames Tutorial is now updated and reflects the changes in 0.20.0 release.

45 Likes

6 posts were split to a new topic: DataFrame join error

Thank you to @bkamins, @quinnj, @nalimilan and other contributors for the continued updates. Over the last few months, updates to DataFrames, CSV and 1.2->1.3 have led to about an 80% speed increase in my data-intensive code, with gains spread fairly evenly across different types of IO operations, computations, and data manipulations.

24 Likes

Whoaā€¦ eye-popping improvement! I compared pandas and DataFrames/CSV just now. For 1GB csv file reading, pandas took 13 seconds while DataFrames/CSV took only 1.8 seconds. Thank contributors very much. You are the geniuses.

13 Likes

Whe have DataFrames.jl release 0.21. This is a very big release with 102 PRs merged. Thanks to all who worked on it (the issues and the PRs). Due to a large number of contributors I list here only the people who opened a merged PR since 0.20 release: anandijain, DilumAluthge, jlumpe, jonas-schulze, nalimilan, nickeubank, non-Jedi, omus, oxinabox, pdeffebach, pearlzli, prosoitos, quinnj, ssikdar1, tkf, vonDonnerstein (I had to remove @ as Discourse disallows mentioning so many users in a single post :smile:).

The detailed release notes (with all issues and PRs closed) is here: Release v0.21.0 Ā· JuliaData/DataFrames.jl Ā· GitHub.

Here are the main highlights:

Breaking:

  • complete redesign of select, select!, transform, transform! and combine (now we roughly match dplyr functionality in a single consistent system; the list of changes is too long to list them here - please read the docstrings of select and combine)
  • deprecate by, map and aggregate
  • deprecate join in favor of innerjoin, outerjoin, etc.
  • columns can be indexed using strings, all functions are updated accordingly
  • all types consistenly support names which produces Vector{String} and propertynames which produces Vector{Symbol}
  • Tables.rows iterates DataFrameRows to avoid compilation for very wide tables
  • remove lastingex without a dimension
  • deprecate names=true in eachcol
  • change ArgumentError to DimensionMismatch in several methods (where it was more suitable)
  • give ErrorException when trying to iterate AbstractDataFrame
  • change ā° to ? when showing a DataFrame and type display improvements
  • make id_vars go first in stack
  • add groupcols and valuecols functions; deprecate groupvars
  • deprecate passing tuple of columns to sort
  • rename deleterows! to delete!
  • change eltype of NamedTuple from DataFrameRow

New features:

  • allow :union as cols kwarg in push! and append!; also allow autopromotion of column eltypes
  • DataFrameRows and DataFrameColumns support Tables.jl interface
  • names allows column selector as a second positional argument
  • variable_eltype kwarg added to stack
  • improve performance of unstack
  • add convert and merge to DataFrameRow
  • define summary for GroupedDataFrame
  • returning an empty table in combine drops a group
  • insertcols! now allows passing multiple columns
  • improve indexing of GroupedDataFrame with keys; make such lookup fast (in consequence DataFrames.jl now provides a fast lookup!)
  • define consistent rules of pseudo-broadcasting in DataFrames.jl (in particular unwrap Ref and 0-dimensional arrays)
  • re-export Tables.jl
  • allow Pair argument in filter and filter!
  • improve flatten
  • add haskey to GroupedDataFrame and GroupKey
  • add eltypes kwag to show
  • add mapcols! and repeat!, fix corner cases of repeat

Bugfixes:

  • fix grouped maximum, minimum, var and std with only missing values
  • fix combine when different functions return groups of different lengths
  • fix combine when DataFrameRow was returned
  • fix the groups field values when GropuedDataFrame is returned by combine (previously map)
  • respect IOContext of io when printing
  • fix eltype in stack with view=true
  • fix circular ref bug in show; improve showing of special types

Other:

  • many documentation improvements
  • improve organization of codebase
  • fix BoundsError messages
  • remove readtable and writetable from deprecated
  • update up to Julia 1.5 nightly

The plans for the future are the following. Ideally the next release is 1.0 and we do not include any breaking changes (the reality might turn out to be different though).

What are key objectives to do after 0.21 release till 1.0 release:

  • documentation improvements
  • decouple DataFramesBase.jl as a lightweight low-level API package
  • adding requested non-breaking functionality
  • find as many bugs as possible before 1.0 release

If this goes as planned we shall make 1.0 release in 3 to 6 months from now (depending how the things progress and the user feedback).

I will also update https://github.com/bkamins/Julia-DataFrames-Tutorial soon (we need other packages to sync with DataFrames.jl release 0.21 before this). I will post when this is done.

40 Likes

This probably will brake my code, but Iā€™m glad things are moving towards 1.0, as far as I know. Thanks everyone for the hard work you put into DataFrames.

1 Like

I tried to add deprecations wherever possible. But for example names now returns Strings which is a hard breakage. Still - we now allow strings for column indexing so hopefully in most cases it should ā€œjust workā€.

1 Like

Sameā€¦ Itā€™s a bit of an inconvenient time for me, but it was never going to be otherwise. This is huge! Thanks to everyone that worked on this! Iā€™m off to the docs to figure out all they ways I need to change my habits - looks like a bunch of my common patterns are deprecated! :sweat_smile:

1 Like

This was a hard decision, but we had to make the changes at some point - the objective was to make it in one-shot so that people need to update their code now, and hopefully nothing major will change in the near future.

2 Likes

I know - I watched some of the issues where breakages were being proposed. I really appreciate all of the thoughtfulness that went into the decisions, and in the end, I think that the current pain will be transient, and the benefits long-lasting.

4 Likes

Thanks! Looks like a great release!

Looks like there are some packages such as ODBC that still arenā€™t compatible but donā€™t restrict the DataFrames version or force a downgrade. I set up an environment to try this release out but hit a roadblock because of this deprecation:

 Warning: `T` is deprecated, use `nonmissingtype` instead.
ā”‚   caller = (::DataStreams.Data.var"#7#8")(::Type{T} where T) at DataStreams.jl:68
ā”” @ DataStreams.Data C:\Users\jsutherland\.julia\packages\DataStreams\mEqAy\src\DataStreams.jl:68

Iā€™ll keep trying going forward.

I think it is best if you open issues in packages you see that have a problem with the new release. You can CC me in these issues so that I can have a look at it. Thank you!

1 Like

I have updated https://github.com/bkamins/Julia-DataFrames-Tutorial with the new functionality. There are still some external packages that need updating (chiefly DataFramesMeta.jl) but I think it is good to have a look now if you are interested. If you find any bugs/things that could be improved please open a PR.

12 Likes

Thanks a lot for the tutorial! Maybe I missed it in the tutorial but I think applying a transformation to each column is an interesting case as well. Something like:

using DataFrames
import Random
Random.seed!(1);
df = DataFrame(id = 'a':'e', a = 1:5, b = 6:10, c = 11:15, d = rand(5), e = -rand(5).+0.5)

to_transform = map(x -> eltype(x) <: Number && all(x.>0), eachcol(df))

out = transform(df, names(df)[to_transform] .=> ByRow(log) .=> Symbol.("log_",names(df)[to_transform]))

Edit:

Just saw we can index with strings now

out = transform(df, names(df)[to_transform] .=> ByRow(log) .=> "log_" .* names(df)[to_transform])
2 Likes

wow. thatā€™s really excellent. thanks for putting the time and effort to create the tutorial.

Great release, especially the data shaping overhaul.

Edit- Deleted inquiry about id_vars going first in stack- realized you were talking about column order, not argument order.

1 Like