Release announcements for DataFrames.jl

kristoffer.carlsson · August 16, 2019, 7:23am

FWIW, I think it would really be worth adding a Project + Manifest to that repo. Then there is no problem getting the exact versions that were used when running the notebooks.

bkamins · August 16, 2019, 8:13am

Good point - I will add this (I started doing these tutorials before Project.toml was an option).

cormullion · August 16, 2019, 9:11am

If you have a few minutes to spare, @bkamins, could you cast your expert eye over the wiki books chapter to see if I’ve made any howlers? It dates back a long way (2015) but I’m sentimental and don’t want to just delete the whole section and point to your (much better) material. (Besides, I’d have to redo all the links. )

(Just mention any problems, don’t waste time learning the Wikibooks markup language . )

bkamins · December 8, 2019, 6:20am

Hi All,

After almost half a year from 0.19 release we made it to 0.20. It is really a big release (79 PRs merged since 0.19).

Firstly, I would like to thank all that contributed to it. A chief person to mention is @nalimilan who constantly has been curating the package. The number of contributors in the period was so large that when I thought about listing all who were involved in the transition from 0.19 to 0.20 it was really hard. I have managed to pull out two groups of logins from GitHub (and apologies if I missed someone - I will gladly correct the list; I dropped @ in front of the logins as Discourse disallows me to mention them all directly in this post, as there are too many of them ):

people who opened a PR that was merged for 0.20 release: aminya, ararslan, asinghvi17, dmolina, Ellipse0934, jlumpe, kojix2, laborg, nalimilan, nilshg, pdeffebach, petershintech, quinnj
people who opened an issue that was discussed for 0.20 release: anandijain, ChrisRackauckas, clintonTE, clynamen, Codsilla, daisy12321, davidanthoff, del2z, Drvi, eperim, evveric, ExpandingMan, felluksch, grahamgill, ianshmean, jablauvelt, juliohm, kescobo, mattBrzezinski, nicoleepp, oschulz, oxinabox, PharmCat, pmarg, pmcvay, proudindiv, pstaabp, rapus95, ronisbr, scls19fr, SimonEnsemble, stakaz, tlienart, ufechner7, waweruk2001, xiaodaigh

Both lists were impressive for me (I have not expected such a big contributing community), so it seems that DataFrame.jl is going strong. Thank you all for working on it.

Now, as this release is so big I am summarizing here only the major changes from 0.19 to 0.20:

some functions (join, groupby and show-related) now check
if data frames passed to them are internally consistent;
this should help users in catching bugs early
Indexing changes
- it is now allowed to create new columns using : as row selector
- broadcasting into an element of a DataFrame is now correct
- when creating a column using broadcasting the new column always has
  the number of rows equal to number of rows of a data frame before
  the operation
- broadcasting over GroupedDataFrame is now reserved
- df[!, cols] is now allowed in setindex! and broadcasting assignment
- generators are now not allowed as left hand side of assignment operation
  to DataFrameRow
describe now shows actual eltype of the column
added allowmissing, disallowmissing and categorical functions
for data frame objects
cleaning up code to avoid throwing more specific error types
unstack now allows for renaming of the columns
Tuple as a value of on keyword argument in join is now deprecated
(use Pair instead)
Between, All and Not are allowed in indexing
DataFrameRows and DataFrameColumns now have custom show methods
DataFrameRows and DataFrameColumns support getproperty now
columnindex from Tables.jl is now exported
categorical! now allows types as cols argument
we no longer use makeunique=true for grouping keys in combine and map
by now has skipmissing keyword argument
redesign push!, append! and vcat to make them more consistent
mapcols now never reuses source columns without copying
significantly improved sort performance
disallowmissing! and disallowmissing now accept error keyword argument
select and select! now allow passing multiple columns arguments
permutecols! is deprecated (use select! instead)
fixed a bug in hash handling in row_group_slots
fully switched to Travis only CI
join now accepts more than two data frames to be joined
rename! now allows permutation of column names in renaming
names! is deprecated (use rename!)
aggregate now accepts skipmissing
gropuby now accepts cols argument to be an empty vector
copycols in DataFrame constructors is now more permissive
(it does not error when it is not possible not to copy when copycols=false)
join now allows mixing Symbol and Pair{Symbol, Symbol} in on keyword argument
added flatten function
views now disallow duplicate columns
describe now has cols keyword argument
add conversion to Array for data frame and data frame row
rename! and rename now accept strings and integers (apart from Symbols) for renaming
io support in describe is deprecated
melt is now deprecated (use stack with Not selector instead)
stackdf and meltdef are now deprecated (use view=true in stack insteead)
GroupedDataFrame now supports keys with information on values of grouping columns;
this is also allowed in GroupedDataFrame indexing (also Tuple or NamedTuple are allowed)
switch to strict upper bounds of package compatibility

Finally, we are fairly close to 1.0 release. Here is a list of current to-do things for 1.0, so as you can see it is not very long (probably it will grow a bit on the way). I plan to have a 0.21 release as a beta before 1.0.

EDIT
The DataFrames Tutorial is now updated and reflects the changes in 0.20.0 release.

nalimilan · December 11, 2019, 9:42am

6 posts were split to a new topic: DataFrame join error

clinton · December 11, 2019, 8:20pm

Thank you to @bkamins, @quinnj, @nalimilan and other contributors for the continued updates. Over the last few months, updates to DataFrames, CSV and 1.2->1.3 have led to about an 80% speed increase in my data-intensive code, with gains spread fairly evenly across different types of IO operations, computations, and data manipulations.

Sijun · December 17, 2019, 6:00am

Whoa… eye-popping improvement! I compared pandas and DataFrames/CSV just now. For 1GB csv file reading, pandas took 13 seconds while DataFrames/CSV took only 1.8 seconds. Thank contributors very much. You are the geniuses.

bkamins · May 5, 2020, 4:16pm

Whe have DataFrames.jl release 0.21. This is a very big release with 102 PRs merged. Thanks to all who worked on it (the issues and the PRs). Due to a large number of contributors I list here only the people who opened a merged PR since 0.20 release: anandijain, DilumAluthge, jlumpe, jonas-schulze, nalimilan, nickeubank, non-Jedi, omus, oxinabox, pdeffebach, pearlzli, prosoitos, quinnj, ssikdar1, tkf, vonDonnerstein (I had to remove @ as Discourse disallows mentioning so many users in a single post ).

The detailed release notes (with all issues and PRs closed) is here: Release v0.21.0 · JuliaData/DataFrames.jl · GitHub.

Here are the main highlights:

Breaking:

complete redesign of select, select!, transform, transform! and combine (now we roughly match dplyr functionality in a single consistent system; the list of changes is too long to list them here - please read the docstrings of select and combine)
deprecate by, map and aggregate
deprecate join in favor of innerjoin, outerjoin, etc.
columns can be indexed using strings, all functions are updated accordingly
all types consistenly support names which produces Vector{String} and propertynames which produces Vector{Symbol}
Tables.rows iterates DataFrameRows to avoid compilation for very wide tables
remove lastingex without a dimension
deprecate names=true in eachcol
change ArgumentError to DimensionMismatch in several methods (where it was more suitable)
give ErrorException when trying to iterate AbstractDataFrame
change ⍰ to ? when showing a DataFrame and type display improvements
make id_vars go first in stack
add groupcols and valuecols functions; deprecate groupvars
deprecate passing tuple of columns to sort
rename deleterows! to delete!
change eltype of NamedTuple from DataFrameRow

New features:

allow :union as cols kwarg in push! and append!; also allow autopromotion of column eltypes
DataFrameRows and DataFrameColumns support Tables.jl interface
names allows column selector as a second positional argument
variable_eltype kwarg added to stack
improve performance of unstack
add convert and merge to DataFrameRow
define summary for GroupedDataFrame
returning an empty table in combine drops a group
insertcols! now allows passing multiple columns
improve indexing of GroupedDataFrame with keys; make such lookup fast (in consequence DataFrames.jl now provides a fast lookup!)
define consistent rules of pseudo-broadcasting in DataFrames.jl (in particular unwrap Ref and 0-dimensional arrays)
re-export Tables.jl
allow Pair argument in filter and filter!
improve flatten
add haskey to GroupedDataFrame and GroupKey
add eltypes kwag to show
add mapcols! and repeat!, fix corner cases of repeat

Bugfixes:

fix grouped maximum, minimum, var and std with only missing values
fix combine when different functions return groups of different lengths
fix combine when DataFrameRow was returned
fix the groups field values when GropuedDataFrame is returned by combine (previously map)
respect IOContext of io when printing
fix eltype in stack with view=true
fix circular ref bug in show; improve showing of special types

Other:

many documentation improvements
improve organization of codebase
fix BoundsError messages
remove readtable and writetable from deprecated
update up to Julia 1.5 nightly

The plans for the future are the following. Ideally the next release is 1.0 and we do not include any breaking changes (the reality might turn out to be different though).

What are key objectives to do after 0.21 release till 1.0 release:

documentation improvements
decouple DataFramesBase.jl as a lightweight low-level API package
adding requested non-breaking functionality
find as many bugs as possible before 1.0 release

If this goes as planned we shall make 1.0 release in 3 to 6 months from now (depending how the things progress and the user feedback).

I will also update https://github.com/bkamins/Julia-DataFrames-Tutorial soon (we need other packages to sync with DataFrames.jl release 0.21 before this). I will post when this is done.

alejandromerchan · May 5, 2020, 4:21pm

This probably will brake my code, but I’m glad things are moving towards 1.0, as far as I know. Thanks everyone for the hard work you put into DataFrames.

bkamins · May 5, 2020, 4:31pm

I tried to add deprecations wherever possible. But for example names now returns Strings which is a hard breakage. Still - we now allow strings for column indexing so hopefully in most cases it should “just work”.

kevbonham · May 5, 2020, 4:43pm

Same… It’s a bit of an inconvenient time for me, but it was never going to be otherwise. This is huge! Thanks to everyone that worked on this! I’m off to the docs to figure out all they ways I need to change my habits - looks like a bunch of my common patterns are deprecated!

bkamins · May 5, 2020, 4:48pm

This was a hard decision, but we had to make the changes at some point - the objective was to make it in one-shot so that people need to update their code now, and hopefully nothing major will change in the near future.

kevbonham · May 5, 2020, 4:53pm

I know - I watched some of the issues where breakages were being proposed. I really appreciate all of the thoughtfulness that went into the decisions, and in the end, I think that the current pain will be transient, and the benefits long-lasting.

matthieu · May 5, 2020, 7:19pm

Thanks! Looks like a great release!

js135005 · May 5, 2020, 7:35pm

Looks like there are some packages such as ODBC that still aren’t compatible but don’t restrict the DataFrames version or force a downgrade. I set up an environment to try this release out but hit a roadblock because of this deprecation:

 Warning: `T` is deprecated, use `nonmissingtype` instead.
│   caller = (::DataStreams.Data.var"#7#8")(::Type{T} where T) at DataStreams.jl:68
└ @ DataStreams.Data C:\Users\jsutherland\.julia\packages\DataStreams\mEqAy\src\DataStreams.jl:68

I’ll keep trying going forward.

bkamins · May 5, 2020, 9:03pm

I think it is best if you open issues in packages you see that have a problem with the new release. You can CC me in these issues so that I can have a look at it. Thank you!

bkamins · May 6, 2020, 3:13pm

I have updated https://github.com/bkamins/Julia-DataFrames-Tutorial with the new functionality. There are still some external packages that need updating (chiefly DataFramesMeta.jl) but I think it is good to have a look now if you are interested. If you find any bugs/things that could be improved please open a PR.

danielw2904 · May 6, 2020, 5:30pm

Thanks a lot for the tutorial! Maybe I missed it in the tutorial but I think applying a transformation to each column is an interesting case as well. Something like:

using DataFrames
import Random
Random.seed!(1);
df = DataFrame(id = 'a':'e', a = 1:5, b = 6:10, c = 11:15, d = rand(5), e = -rand(5).+0.5)

to_transform = map(x -> eltype(x) <: Number && all(x.>0), eachcol(df))

out = transform(df, names(df)[to_transform] .=> ByRow(log) .=> Symbol.("log_",names(df)[to_transform]))

Edit:

Just saw we can index with strings now

out = transform(df, names(df)[to_transform] .=> ByRow(log) .=> "log_" .* names(df)[to_transform])

purplishrock · May 6, 2020, 5:38pm

wow. that’s really excellent. thanks for putting the time and effort to create the tutorial.

clinton · May 7, 2020, 5:33am

Great release, especially the data shaping overhaul.

Edit- Deleted inquiry about id_vars going first in stack- realized you were talking about column order, not argument order.

Topic		Replies	Views
Easier way to split-apply-combine in DataFrames.jl General Usage dataframes	5	1107	December 14, 2020
DataFrame groups as an argument of a function General Usage question , dataframes	15	919	November 23, 2021
How to `combine` row vectors Data dataframes	5	123	December 18, 2024
DataFramesMeta.jl version 0.11.0 Release Package Announcements dataframesmeta	0	531	April 18, 2022
Data Cleaning: Split, Combine, Apply? New to Julia dataframes	9	781	October 28, 2021

Release announcements for DataFrames.jl

Related topics