FWIW, I think it would really be worth adding a Project + Manifest to that repo. Then there is no problem getting the exact versions that were used when running the notebooks.
Good point - I will add this (I started doing these tutorials before Project.toml was an option).
If you have a few minutes to spare, @bkamins, could you cast your expert eye over the wiki books chapter to see if Iāve made any howlers? It dates back a long way (2015) but Iām sentimental and donāt want to just delete the whole section and point to your (much better) material. (Besides, Iād have to redo all the links. )
(Just mention any problems, donāt waste time learning the Wikibooks markup language . )
Hi All,
After almost half a year from 0.19 release we made it to 0.20. It is really a big release (79 PRs merged since 0.19).
Firstly, I would like to thank all that contributed to it. A chief person to mention is @nalimilan who constantly has been curating the package. The number of contributors in the period was so large that when I thought about listing all who were involved in the transition from 0.19 to 0.20 it was really hard. I have managed to pull out two groups of logins from GitHub (and apologies if I missed someone - I will gladly correct the list; I dropped @ in front of the logins as Discourse disallows me to mention them all directly in this post, as there are too many of them ):
- people who opened a PR that was merged for 0.20 release: aminya, ararslan, asinghvi17, dmolina, Ellipse0934, jlumpe, kojix2, laborg, nalimilan, nilshg, pdeffebach, petershintech, quinnj
- people who opened an issue that was discussed for 0.20 release: anandijain, ChrisRackauckas, clintonTE, clynamen, Codsilla, daisy12321, davidanthoff, del2z, Drvi, eperim, evveric, ExpandingMan, felluksch, grahamgill, ianshmean, jablauvelt, juliohm, kescobo, mattBrzezinski, nicoleepp, oschulz, oxinabox, PharmCat, pmarg, pmcvay, proudindiv, pstaabp, rapus95, ronisbr, scls19fr, SimonEnsemble, stakaz, tlienart, ufechner7, waweruk2001, xiaodaigh
Both lists were impressive for me (I have not expected such a big contributing community), so it seems that DataFrame.jl is going strong. Thank you all for working on it.
Now, as this release is so big I am summarizing here only the major changes from 0.19 to 0.20:
- some functions (join, groupby and show-related) now check
if data frames passed to them are internally consistent;
this should help users in catching bugs early - Indexing changes
- it is now allowed to create new columns using
:
as row selector - broadcasting into an element of a
DataFrame
is now correct - when creating a column using broadcasting the new column always has
the number of rows equal to number of rows of a data frame before
the operation - broadcasting over
GroupedDataFrame
is now reserved -
df[!, cols]
is now allowed insetindex!
and broadcasting assignment - generators are now not allowed as left hand side of assignment operation
toDataFrameRow
- it is now allowed to create new columns using
-
describe
now shows actualeltype
of the column - added
allowmissing
,disallowmissing
andcategorical
functions
for data frame objects - cleaning up code to avoid throwing more specific error types
-
unstack
now allows for renaming of the columns -
Tuple
as a value ofon
keyword argument injoin
is now deprecated
(usePair
instead) -
Between
,All
andNot
are allowed in indexing -
DataFrameRows
andDataFrameColumns
now have customshow
methods -
DataFrameRows
andDataFrameColumns
supportgetproperty
now -
columnindex
from Tables.jl is now exported -
categorical!
now allows types ascols
argument - we no longer use
makeunique=true
for grouping keys incombine
andmap
-
by
now hasskipmissing
keyword argument - redesign
push!
,append!
andvcat
to make them more consistent -
mapcols
now never reuses source columns without copying - significantly improved
sort
performance -
disallowmissing!
anddisallowmissing
now accepterror
keyword argument -
select
andselect!
now allow passing multiple columns arguments -
permutecols!
is deprecated (useselect!
instead) - fixed a bug in hash handling in
row_group_slots
- fully switched to Travis only CI
-
join
now accepts more than two data frames to be joined -
rename!
now allows permutation of column names in renaming -
names!
is deprecated (userename!
) -
aggregate
now acceptsskipmissing
-
gropuby
now acceptscols
argument to be an empty vector -
copycols
inDataFrame
constructors is now more permissive
(it does not error when it is not possible not to copy whencopycols=false
) -
join
now allows mixingSymbol
andPair{Symbol, Symbol}
inon
keyword argument - added
flatten
function - views now disallow duplicate columns
-
describe
now hascols
keyword argument - add conversion to
Array
for data frame and data frame row -
rename!
andrename
now accept strings and integers (apart fromSymbol
s) for renaming -
io
support indescribe
is deprecated -
melt
is now deprecated (usestack
withNot
selector instead) -
stackdf
andmeltdef
are now deprecated (useview=true
instack
insteead) -
GroupedDataFrame
now supportskeys
with information on values of grouping columns;
this is also allowed inGroupedDataFrame
indexing (alsoTuple
orNamedTuple
are allowed) - switch to strict upper bounds of package compatibility
Finally, we are fairly close to 1.0 release. Here is a list of current to-do things for 1.0, so as you can see it is not very long (probably it will grow a bit on the way). I plan to have a 0.21 release as a beta before 1.0.
EDIT
The DataFrames Tutorial is now updated and reflects the changes in 0.20.0 release.
6 posts were split to a new topic: DataFrame join error
Thank you to @bkamins, @quinnj, @nalimilan and other contributors for the continued updates. Over the last few months, updates to DataFrames, CSV and 1.2->1.3 have led to about an 80% speed increase in my data-intensive code, with gains spread fairly evenly across different types of IO operations, computations, and data manipulations.
Whoaā¦ eye-popping improvement! I compared pandas and DataFrames/CSV just now. For 1GB csv file reading, pandas took 13 seconds while DataFrames/CSV took only 1.8 seconds. Thank contributors very much. You are the geniuses.
Whe have DataFrames.jl release 0.21. This is a very big release with 102 PRs merged. Thanks to all who worked on it (the issues and the PRs). Due to a large number of contributors I list here only the people who opened a merged PR since 0.20 release: anandijain, DilumAluthge, jlumpe, jonas-schulze, nalimilan, nickeubank, non-Jedi, omus, oxinabox, pdeffebach, pearlzli, prosoitos, quinnj, ssikdar1, tkf, vonDonnerstein (I had to remove @ as Discourse disallows mentioning so many users in a single post ).
The detailed release notes (with all issues and PRs closed) is here: Release v0.21.0 Ā· JuliaData/DataFrames.jl Ā· GitHub.
Here are the main highlights:
Breaking:
- complete redesign of
select
,select!
,transform
,transform!
andcombine
(now we roughly match dplyr functionality in a single consistent system; the list of changes is too long to list them here - please read the docstrings ofselect
andcombine
) - deprecate
by
,map
andaggregate
- deprecate
join
in favor ofinnerjoin
,outerjoin
, etc. - columns can be indexed using strings, all functions are updated accordingly
- all types consistenly support
names
which producesVector{String}
andpropertynames
which producesVector{Symbol}
- Tables.rows iterates
DataFrameRows
to avoid compilation for very wide tables - remove
lastingex
without a dimension - deprecate
names=true
ineachcol
- change
ArgumentError
toDimensionMismatch
in several methods (where it was more suitable) - give
ErrorException
when trying to iterateAbstractDataFrame
- change
ā°
to?
when showing a DataFrame and type display improvements - make
id_vars
go first instack
- add
groupcols
andvaluecols
functions; deprecategroupvars
- deprecate passing tuple of columns to
sort
- rename
deleterows!
todelete!
- change
eltype
ofNamedTuple
fromDataFrameRow
New features:
- allow
:union
ascols
kwarg inpush!
andappend!
; also allow autopromotion of column eltypes -
DataFrameRows
andDataFrameColumns
support Tables.jl interface -
names
allows column selector as a second positional argument -
variable_eltype
kwarg added tostack
- improve performance of
unstack
- add
convert
andmerge
toDataFrameRow
- define
summary
forGroupedDataFrame
- returning an empty table in
combine
drops a group -
insertcols!
now allows passing multiple columns - improve indexing of
GroupedDataFrame
with keys; make such lookup fast (in consequence DataFrames.jl now provides a fast lookup!) - define consistent rules of pseudo-broadcasting in DataFrames.jl (in particular unwrap
Ref
and0
-dimensional arrays) - re-export Tables.jl
- allow
Pair
argument infilter
andfilter!
- improve
flatten
- add
haskey
toGroupedDataFrame
andGroupKey
- add
eltypes
kwag toshow
- add
mapcols!
andrepeat!
, fix corner cases ofrepeat
Bugfixes:
- fix grouped maximum, minimum, var and std with only missing values
- fix
combine
when different functions return groups of different lengths - fix
combine
whenDataFrameRow
was returned - fix the
groups
field values whenGropuedDataFrame
is returned bycombine
(previouslymap
) - respect
IOContext
ofio
when printing - fix eltype in
stack
withview=true
- fix circular ref bug in
show
; improve showing of special types
Other:
- many documentation improvements
- improve organization of codebase
- fix
BoundsError
messages - remove
readtable
andwritetable
from deprecated - update up to Julia 1.5 nightly
The plans for the future are the following. Ideally the next release is 1.0 and we do not include any breaking changes (the reality might turn out to be different though).
What are key objectives to do after 0.21 release till 1.0 release:
- documentation improvements
- decouple DataFramesBase.jl as a lightweight low-level API package
- adding requested non-breaking functionality
- find as many bugs as possible before 1.0 release
If this goes as planned we shall make 1.0 release in 3 to 6 months from now (depending how the things progress and the user feedback).
I will also update https://github.com/bkamins/Julia-DataFrames-Tutorial soon (we need other packages to sync with DataFrames.jl release 0.21 before this). I will post when this is done.
This probably will brake my code, but Iām glad things are moving towards 1.0, as far as I know. Thanks everyone for the hard work you put into DataFrames.
I tried to add deprecations wherever possible. But for example names
now returns String
s which is a hard breakage. Still - we now allow strings for column indexing so hopefully in most cases it should ājust workā.
Sameā¦ Itās a bit of an inconvenient time for me, but it was never going to be otherwise. This is huge! Thanks to everyone that worked on this! Iām off to the docs to figure out all they ways I need to change my habits - looks like a bunch of my common patterns are deprecated!
This was a hard decision, but we had to make the changes at some point - the objective was to make it in one-shot so that people need to update their code now, and hopefully nothing major will change in the near future.
I know - I watched some of the issues where breakages were being proposed. I really appreciate all of the thoughtfulness that went into the decisions, and in the end, I think that the current pain will be transient, and the benefits long-lasting.
Thanks! Looks like a great release!
Looks like there are some packages such as ODBC that still arenāt compatible but donāt restrict the DataFrames version or force a downgrade. I set up an environment to try this release out but hit a roadblock because of this deprecation:
Warning: `T` is deprecated, use `nonmissingtype` instead.
ā caller = (::DataStreams.Data.var"#7#8")(::Type{T} where T) at DataStreams.jl:68
ā @ DataStreams.Data C:\Users\jsutherland\.julia\packages\DataStreams\mEqAy\src\DataStreams.jl:68
Iāll keep trying going forward.
I think it is best if you open issues in packages you see that have a problem with the new release. You can CC me in these issues so that I can have a look at it. Thank you!
I have updated https://github.com/bkamins/Julia-DataFrames-Tutorial with the new functionality. There are still some external packages that need updating (chiefly DataFramesMeta.jl) but I think it is good to have a look now if you are interested. If you find any bugs/things that could be improved please open a PR.
Thanks a lot for the tutorial! Maybe I missed it in the tutorial but I think applying a transformation to each column is an interesting case as well. Something like:
using DataFrames
import Random
Random.seed!(1);
df = DataFrame(id = 'a':'e', a = 1:5, b = 6:10, c = 11:15, d = rand(5), e = -rand(5).+0.5)
to_transform = map(x -> eltype(x) <: Number && all(x.>0), eachcol(df))
out = transform(df, names(df)[to_transform] .=> ByRow(log) .=> Symbol.("log_",names(df)[to_transform]))
Edit:
Just saw we can index with strings now
out = transform(df, names(df)[to_transform] .=> ByRow(log) .=> "log_" .* names(df)[to_transform])
wow. thatās really excellent. thanks for putting the time and effort to create the tutorial.
Great release, especially the data shaping overhaul.
Edit- Deleted inquiry about id_vars
going first in stack
- realized you were talking about column order, not argument order.