FWIW, I think it would really be worth adding a Project + Manifest to that repo. Then there is no problem getting the exact versions that were used when running the notebooks.
Good point - I will add this (I started doing these tutorials before Project.toml was an option).
If you have a few minutes to spare, @bkamins, could you cast your expert eye over the wiki books chapter to see if Iāve made any howlers? It dates back a long way (2015) but Iām sentimental and donāt want to just delete the whole section and point to your (much better) material. (Besides, Iād have to redo all the links.
)
(Just mention any problems, donāt waste time learning the Wikibooks markup language .
)
Hi All,
After almost half a year from 0.19 release we made it to 0.20. It is really a big release (79 PRs merged since 0.19).
Firstly, I would like to thank all that contributed to it. A chief person to mention is @nalimilan who constantly has been curating the package. The number of contributors in the period was so large that when I thought about listing all who were involved in the transition from 0.19 to 0.20 it was really hard. I have managed to pull out two groups of logins from GitHub (and apologies if I missed someone - I will gladly correct the list; I dropped @ in front of the logins as Discourse disallows me to mention them all directly in this post, as there are too many of them
):
- people who opened a PR that was merged for 0.20 release: aminya, ararslan, asinghvi17, dmolina, Ellipse0934, jlumpe, kojix2, laborg, nalimilan, nilshg, pdeffebach, petershintech, quinnj
- people who opened an issue that was discussed for 0.20 release: anandijain, ChrisRackauckas, clintonTE, clynamen, Codsilla, daisy12321, davidanthoff, del2z, Drvi, eperim, evveric, ExpandingMan, felluksch, grahamgill, ianshmean, jablauvelt, juliohm, kescobo, mattBrzezinski, nicoleepp, oschulz, oxinabox, PharmCat, pmarg, pmcvay, proudindiv, pstaabp, rapus95, ronisbr, scls19fr, SimonEnsemble, stakaz, tlienart, ufechner7, waweruk2001, xiaodaigh
Both lists were impressive for me (I have not expected such a big contributing community), so it seems that DataFrame.jl is going strong. Thank you all for working on it.
Now, as this release is so big I am summarizing here only the major changes from 0.19 to 0.20:
- some functions (join, groupby and show-related) now check
if data frames passed to them are internally consistent;
this should help users in catching bugs early - Indexing changes
- it is now allowed to create new columns using
:as row selector - broadcasting into an element of a
DataFrameis now correct - when creating a column using broadcasting the new column always has
the number of rows equal to number of rows of a data frame before
the operation - broadcasting over
GroupedDataFrameis now reserved -
df[!, cols]is now allowed insetindex!and broadcasting assignment - generators are now not allowed as left hand side of assignment operation
toDataFrameRow
- it is now allowed to create new columns using
-
describenow shows actualeltypeof the column - added
allowmissing,disallowmissingandcategoricalfunctions
for data frame objects - cleaning up code to avoid throwing more specific error types
-
unstacknow allows for renaming of the columns -
Tupleas a value ofonkeyword argument injoinis now deprecated
(usePairinstead) -
Between,AllandNotare allowed in indexing -
DataFrameRowsandDataFrameColumnsnow have customshowmethods -
DataFrameRowsandDataFrameColumnssupportgetpropertynow -
columnindexfrom Tables.jl is now exported -
categorical!now allows types ascolsargument - we no longer use
makeunique=truefor grouping keys incombineandmap -
bynow hasskipmissingkeyword argument - redesign
push!,append!andvcatto make them more consistent -
mapcolsnow never reuses source columns without copying - significantly improved
sortperformance -
disallowmissing!anddisallowmissingnow accepterrorkeyword argument -
selectandselect!now allow passing multiple columns arguments -
permutecols!is deprecated (useselect!instead) - fixed a bug in hash handling in
row_group_slots - fully switched to Travis only CI
-
joinnow accepts more than two data frames to be joined -
rename!now allows permutation of column names in renaming -
names!is deprecated (userename!) -
aggregatenow acceptsskipmissing -
gropubynow acceptscolsargument to be an empty vector -
copycolsinDataFrameconstructors is now more permissive
(it does not error when it is not possible not to copy whencopycols=false) -
joinnow allows mixingSymbolandPair{Symbol, Symbol}inonkeyword argument - added
flattenfunction - views now disallow duplicate columns
-
describenow hascolskeyword argument - add conversion to
Arrayfor data frame and data frame row -
rename!andrenamenow accept strings and integers (apart fromSymbols) for renaming -
iosupport indescribeis deprecated -
meltis now deprecated (usestackwithNotselector instead) -
stackdfandmeltdefare now deprecated (useview=trueinstackinsteead) -
GroupedDataFramenow supportskeyswith information on values of grouping columns;
this is also allowed inGroupedDataFrameindexing (alsoTupleorNamedTupleare allowed) - switch to strict upper bounds of package compatibility
Finally, we are fairly close to 1.0 release. Here is a list of current to-do things for 1.0, so as you can see it is not very long (probably it will grow a bit on the way). I plan to have a 0.21 release as a beta before 1.0.
EDIT
The DataFrames Tutorial is now updated and reflects the changes in 0.20.0 release.
6 posts were split to a new topic: DataFrame join error
Thank you to @bkamins, @quinnj, @nalimilan and other contributors for the continued updates. Over the last few months, updates to DataFrames, CSV and 1.2->1.3 have led to about an 80% speed increase in my data-intensive code, with gains spread fairly evenly across different types of IO operations, computations, and data manipulations.
Whoa⦠eye-popping improvement! I compared pandas and DataFrames/CSV just now. For 1GB csv file reading, pandas took 13 seconds while DataFrames/CSV took only 1.8 seconds. Thank contributors very much. You are the geniuses.
Whe have DataFrames.jl release 0.21. This is a very big release with 102 PRs merged. Thanks to all who worked on it (the issues and the PRs). Due to a large number of contributors I list here only the people who opened a merged PR since 0.20 release: anandijain, DilumAluthge, jlumpe, jonas-schulze, nalimilan, nickeubank, non-Jedi, omus, oxinabox, pdeffebach, pearlzli, prosoitos, quinnj, ssikdar1, tkf, vonDonnerstein (I had to remove @ as Discourse disallows mentioning so many users in a single post
).
The detailed release notes (with all issues and PRs closed) is here: Release v0.21.0 Ā· JuliaData/DataFrames.jl Ā· GitHub.
Here are the main highlights:
Breaking:
- complete redesign of
select,select!,transform,transform!andcombine(now we roughly match dplyr functionality in a single consistent system; the list of changes is too long to list them here - please read the docstrings ofselectandcombine) - deprecate
by,mapandaggregate - deprecate
joinin favor ofinnerjoin,outerjoin, etc. - columns can be indexed using strings, all functions are updated accordingly
- all types consistenly support
nameswhich producesVector{String}andpropertynameswhich producesVector{Symbol} - Tables.rows iterates
DataFrameRowsto avoid compilation for very wide tables - remove
lastingexwithout a dimension - deprecate
names=trueineachcol - change
ArgumentErrortoDimensionMismatchin several methods (where it was more suitable) - give
ErrorExceptionwhen trying to iterateAbstractDataFrame - change
ā°to?when showing a DataFrame and type display improvements - make
id_varsgo first instack - add
groupcolsandvaluecolsfunctions; deprecategroupvars - deprecate passing tuple of columns to
sort - rename
deleterows!todelete! - change
eltypeofNamedTuplefromDataFrameRow
New features:
- allow
:unionascolskwarg inpush!andappend!; also allow autopromotion of column eltypes -
DataFrameRowsandDataFrameColumnssupport Tables.jl interface -
namesallows column selector as a second positional argument -
variable_eltypekwarg added tostack - improve performance of
unstack - add
convertandmergetoDataFrameRow - define
summaryforGroupedDataFrame - returning an empty table in
combinedrops a group -
insertcols!now allows passing multiple columns - improve indexing of
GroupedDataFramewith keys; make such lookup fast (in consequence DataFrames.jl now provides a fast lookup!) - define consistent rules of pseudo-broadcasting in DataFrames.jl (in particular unwrap
Refand0-dimensional arrays) - re-export Tables.jl
- allow
Pairargument infilterandfilter! - improve
flatten - add
haskeytoGroupedDataFrameandGroupKey - add
eltypeskwag toshow - add
mapcols!andrepeat!, fix corner cases ofrepeat
Bugfixes:
- fix grouped maximum, minimum, var and std with only missing values
- fix
combinewhen different functions return groups of different lengths - fix
combinewhenDataFrameRowwas returned - fix the
groupsfield values whenGropuedDataFrameis returned bycombine(previouslymap) - respect
IOContextofiowhen printing - fix eltype in
stackwithview=true - fix circular ref bug in
show; improve showing of special types
Other:
- many documentation improvements
- improve organization of codebase
- fix
BoundsErrormessages - remove
readtableandwritetablefrom deprecated - update up to Julia 1.5 nightly
The plans for the future are the following. Ideally the next release is 1.0 and we do not include any breaking changes (the reality might turn out to be different though).
What are key objectives to do after 0.21 release till 1.0 release:
- documentation improvements
- decouple DataFramesBase.jl as a lightweight low-level API package
- adding requested non-breaking functionality
- find as many bugs as possible before 1.0 release
If this goes as planned we shall make 1.0 release in 3 to 6 months from now (depending how the things progress and the user feedback).
I will also update https://github.com/bkamins/Julia-DataFrames-Tutorial soon (we need other packages to sync with DataFrames.jl release 0.21 before this). I will post when this is done.
This probably will brake my code, but Iām glad things are moving towards 1.0, as far as I know. Thanks everyone for the hard work you put into DataFrames.
I tried to add deprecations wherever possible. But for example names now returns Strings which is a hard breakage. Still - we now allow strings for column indexing so hopefully in most cases it should ājust workā.
Same⦠Itās a bit of an inconvenient time for me, but it was never going to be otherwise. This is huge! Thanks to everyone that worked on this! Iām off to the docs to figure out all they ways I need to change my habits - looks like a bunch of my common patterns are deprecated! ![]()
This was a hard decision, but we had to make the changes at some point - the objective was to make it in one-shot so that people need to update their code now, and hopefully nothing major will change in the near future.
I know - I watched some of the issues where breakages were being proposed. I really appreciate all of the thoughtfulness that went into the decisions, and in the end, I think that the current pain will be transient, and the benefits long-lasting.
Thanks! Looks like a great release!
Looks like there are some packages such as ODBC that still arenāt compatible but donāt restrict the DataFrames version or force a downgrade. I set up an environment to try this release out but hit a roadblock because of this deprecation:
Warning: `T` is deprecated, use `nonmissingtype` instead.
ā caller = (::DataStreams.Data.var"#7#8")(::Type{T} where T) at DataStreams.jl:68
ā @ DataStreams.Data C:\Users\jsutherland\.julia\packages\DataStreams\mEqAy\src\DataStreams.jl:68
Iāll keep trying going forward.
I think it is best if you open issues in packages you see that have a problem with the new release. You can CC me in these issues so that I can have a look at it. Thank you!
I have updated https://github.com/bkamins/Julia-DataFrames-Tutorial with the new functionality. There are still some external packages that need updating (chiefly DataFramesMeta.jl) but I think it is good to have a look now if you are interested. If you find any bugs/things that could be improved please open a PR.
Thanks a lot for the tutorial! Maybe I missed it in the tutorial but I think applying a transformation to each column is an interesting case as well. Something like:
using DataFrames
import Random
Random.seed!(1);
df = DataFrame(id = 'a':'e', a = 1:5, b = 6:10, c = 11:15, d = rand(5), e = -rand(5).+0.5)
to_transform = map(x -> eltype(x) <: Number && all(x.>0), eachcol(df))
out = transform(df, names(df)[to_transform] .=> ByRow(log) .=> Symbol.("log_",names(df)[to_transform]))
Edit:
Just saw we can index with strings now
out = transform(df, names(df)[to_transform] .=> ByRow(log) .=> "log_" .* names(df)[to_transform])
wow. thatās really excellent. thanks for putting the time and effort to create the tutorial.
Great release, especially the data shaping overhaul.
Edit- Deleted inquiry about id_vars going first in stack- realized you were talking about column order, not argument order.