Release announcements for DataFrames.jl

announcement

#21

Also there is this issue for fast path with checking if the data frame is sorted (actually what we would need is continuous blocks - not necessarily sorted, but this is harder to do), as this is a common case and also https://github.com/h2oai/db-benchmark has specific tests for it.


#22

I have updated DataFrames package to 15 version. When I run the following command in juno/jupyter, it started throwing a warning like this.

Can someone explain this to me and the corrections to code?

using MLDataUtils  #MLDataPattern,
@views begin
    X,y=undersample((X_train,y_train),shuffle=false);
    # X_resamp,y_resamp=oversample((X_train,y_train),fraction=0.2);
    X =  convert(Array{Float64}, X);
    y = convert(Array{Int64}, y);
    X_tst=  convert(Array{Float64}, X_test);
    y_tst =  convert(Array{Int64}, y_test);
    X_train0 =  convert(Array{Float64}, X_train);
    y_train0 = convert(Array{Int64}, y_train);
end

Warning: Indexing into a return value of columns on SubDataFrame will return a view of column value
│   caller = iterate at abstractarray.jl:838 [inlined]
└ @ Core .\abstractarray.jl:838
┌ Warning: Indexing into a return value of columns on SubDataFrame will return a view of column value
│   caller = iterate at abstractarray.jl:838 [inlined]
└ @ Core .\abstractarray.jl:838

#23

You do not have to correct this. The warnings will go away in DataFrames 0.16.0 which should be released relatively soon.

The reason is that we want to introduce an optimization that when you access a column of a SubDataFrame it is not copied, but a view is returned which should be more efficient (especially for large data sets).

However, as this change is breaking; it will e.g. mean that when you mutate what you got you will mutate the source. Actually - this is what should have happend in the past (this is the contratct the DataFrames.jl promised, not to perform copy on column read) - but the implementation was different.

In 99% of use-cases this should not matter, so you can ignore the warnings.


#24

Thanks. One more quesiton. using first(df,n) instead of head(df,6) gives this warning.

first(data,6)

Warning: In the future eachcol will have names argument set to false by default
│   caller = getmaxwidths(::DataFrame, ::UnitRange{Int64}, ::UnitRange{Int64}, ::Symbol) at show.jl:105
└ @ DataFrames C:\Users\chatura\.julia\packages\DataFrames\5Rg4Y\src\abstractdataframe\show.jl:105

#25

This deprecation is already fixed in 0.15.1 as it was non-breaking, so please check-out latest release of the DataFrames.jl package.


#26

Thanks for your quick responses.


#27

Alof of these ideas are already implemented in FastGroupBy.jl I just need to update it for Julia v1 and we can start benchmarking! I think the InternedString.jl will yield HUGE improvements for string grouping, because once the strings are interned, we can group by performing a radix-sort on the pointers (which are just UInt64), which is super fast! In fact, I have created a faster radix-sort than data.table’s in SortingLab.jl; again I need to update it to Julia v1!

Can’t wait to release my updated benchmarks on Julia v1, but before that there is advent of code, work, set up personal website, disk.frame … I wish I can land a full time job to focus on the data ecosystem in Julia. Willing to take pay cut :slight_smile:.


#28

I have updated https://github.com/bkamins/Julia-DataFrames-Tutorial to DataFrames 0.15.2. You will see a bunch of deprecation warnings there (not that much but still a bit distirbing).
They will go away in DataFrames 0.16, but I released it now to highlight the upcoming changes in the underlying mechanics (mainly how getindex, veiw and eachcol work).


#29

DataFrames.jl 0.16 has been released. You can find the list of changes in the release notes here https://github.com/JuliaData/DataFrames.jl/releases/tag/v0.16.0. The key change is finishing deprecation period for getindex and view methods.

The tutorial at https://github.com/bkamins/Julia-DataFrames-Tutorial has been updated.


#30

DataFrames.jl 0.17 has been released (and if you are interested in these updated please subscribe to this thread for the future).

The release notes give the details of the changes. The most important changes are:

  • Improved performance of split-apply-combine functions (:+1: for @nalimilan).
  • view is now fully functional and fast in all cases (we always remember the parent DataFrame without creating copies).
  • Improved showing of data frames related types.
  • Added push! for DataFrameRow.
  • DataFrame constructor accepts SubDataFrame and DataFrameRow objects.

My tutorial https://github.com/bkamins/Julia-DataFrames-Tutorial has been updated to include the new functionality.


#31

A small, but relevant update. With this version of DataFrames.jl we also re-export a new version of Missings.jl, which introduces passmissing function. We would welcome a feedback on this functionality (as it might eventually be considered for inclusion into Julia Base).

The passmissing function wraps any Julia function f (typically missing “unaware”) accepting positional arguments so that if any of them is missing it returns missing and otherwise calls f with the passed arguments. Here is an example showing you the difference:

julia> string(missing, " ", missing)
"missing missing"

julia> passmissing(string)(missing, " ", missing)
missing

#32

We have released DataFrames.jl version 0.17.1. Here you can read release notes.

Apart from polishing corner cases, the thing that a typical user might be interested in is that two convenience functions for working with GroupedDataFrame were introduced:

  • groupvars: gives you a vector of names of columns in the parent data frame used for grouping
  • groupindices: gives you a vector of row group indices in the parent data frame

https://github.com/bkamins/Julia-DataFrames-Tutorial has been updated to reflect the changes in the release.