DataFrames v0.15.0 released

announcement

#21

Also there is this issue for fast path with checking if the data frame is sorted (actually what we would need is continuous blocks - not necessarily sorted, but this is harder to do), as this is a common case and also https://github.com/h2oai/db-benchmark has specific tests for it.


#22

I have updated DataFrames package to 15 version. When I run the following command in juno/jupyter, it started throwing a warning like this.

Can someone explain this to me and the corrections to code?

using MLDataUtils  #MLDataPattern,
@views begin
    X,y=undersample((X_train,y_train),shuffle=false);
    # X_resamp,y_resamp=oversample((X_train,y_train),fraction=0.2);
    X =  convert(Array{Float64}, X);
    y = convert(Array{Int64}, y);
    X_tst=  convert(Array{Float64}, X_test);
    y_tst =  convert(Array{Int64}, y_test);
    X_train0 =  convert(Array{Float64}, X_train);
    y_train0 = convert(Array{Int64}, y_train);
end

Warning: Indexing into a return value of columns on SubDataFrame will return a view of column value
β”‚   caller = iterate at abstractarray.jl:838 [inlined]
β”” @ Core .\abstractarray.jl:838
β”Œ Warning: Indexing into a return value of columns on SubDataFrame will return a view of column value
β”‚   caller = iterate at abstractarray.jl:838 [inlined]
β”” @ Core .\abstractarray.jl:838

#23

You do not have to correct this. The warnings will go away in DataFrames 0.16.0 which should be released relatively soon.

The reason is that we want to introduce an optimization that when you access a column of a SubDataFrame it is not copied, but a view is returned which should be more efficient (especially for large data sets).

However, as this change is breaking; it will e.g. mean that when you mutate what you got you will mutate the source. Actually - this is what should have happend in the past (this is the contratct the DataFrames.jl promised, not to perform copy on column read) - but the implementation was different.

In 99% of use-cases this should not matter, so you can ignore the warnings.


#24

Thanks. One more quesiton. using first(df,n) instead of head(df,6) gives this warning.

first(data,6)

Warning: In the future eachcol will have names argument set to false by default
β”‚   caller = getmaxwidths(::DataFrame, ::UnitRange{Int64}, ::UnitRange{Int64}, ::Symbol) at show.jl:105
β”” @ DataFrames C:\Users\chatura\.julia\packages\DataFrames\5Rg4Y\src\abstractdataframe\show.jl:105

#25

This deprecation is already fixed in 0.15.1 as it was non-breaking, so please check-out latest release of the DataFrames.jl package.


#26

Thanks for your quick responses.


#27

Alof of these ideas are already implemented in FastGroupBy.jl I just need to update it for Julia v1 and we can start benchmarking! I think the InternedString.jl will yield HUGE improvements for string grouping, because once the strings are interned, we can group by performing a radix-sort on the pointers (which are just UInt64), which is super fast! In fact, I have created a faster radix-sort than data.table’s in SortingLab.jl; again I need to update it to Julia v1!

Can’t wait to release my updated benchmarks on Julia v1, but before that there is advent of code, work, set up personal website, disk.frame … I wish I can land a full time job to focus on the data ecosystem in Julia. Willing to take pay cut :slight_smile:.


#28

I have updated https://github.com/bkamins/Julia-DataFrames-Tutorial to DataFrames 0.15.2. You will see a bunch of deprecation warnings there (not that much but still a bit distirbing).
They will go away in DataFrames 0.16, but I released it now to highlight the upcoming changes in the underlying mechanics (mainly how getindex, veiw and eachcol work).