Release announcements for DataFrames.jl

Also there is this issue for fast path with checking if the data frame is sorted (actually what we would need is continuous blocks - not necessarily sorted, but this is harder to do), as this is a common case and also https://github.com/h2oai/db-benchmark has specific tests for it.

1 Like

I have updated DataFrames package to 15 version. When I run the following command in juno/jupyter, it started throwing a warning like this.

Can someone explain this to me and the corrections to code?

using MLDataUtils  #MLDataPattern,
@views begin
    X,y=undersample((X_train,y_train),shuffle=false);
    # X_resamp,y_resamp=oversample((X_train,y_train),fraction=0.2);
    X =  convert(Array{Float64}, X);
    y = convert(Array{Int64}, y);
    X_tst=  convert(Array{Float64}, X_test);
    y_tst =  convert(Array{Int64}, y_test);
    X_train0 =  convert(Array{Float64}, X_train);
    y_train0 = convert(Array{Int64}, y_train);
end

Warning: Indexing into a return value of columns on SubDataFrame will return a view of column value
β”‚   caller = iterate at abstractarray.jl:838 [inlined]
β”” @ Core .\abstractarray.jl:838
β”Œ Warning: Indexing into a return value of columns on SubDataFrame will return a view of column value
β”‚   caller = iterate at abstractarray.jl:838 [inlined]
β”” @ Core .\abstractarray.jl:838

You do not have to correct this. The warnings will go away in DataFrames 0.16.0 which should be released relatively soon.

The reason is that we want to introduce an optimization that when you access a column of a SubDataFrame it is not copied, but a view is returned which should be more efficient (especially for large data sets).

However, as this change is breaking; it will e.g. mean that when you mutate what you got you will mutate the source. Actually - this is what should have happend in the past (this is the contratct the DataFrames.jl promised, not to perform copy on column read) - but the implementation was different.

In 99% of use-cases this should not matter, so you can ignore the warnings.

1 Like

Thanks. One more quesiton. using first(df,n) instead of head(df,6) gives this warning.

first(data,6)

Warning: In the future eachcol will have names argument set to false by default
β”‚   caller = getmaxwidths(::DataFrame, ::UnitRange{Int64}, ::UnitRange{Int64}, ::Symbol) at show.jl:105
β”” @ DataFrames C:\Users\chatura\.julia\packages\DataFrames\5Rg4Y\src\abstractdataframe\show.jl:105

This deprecation is already fixed in 0.15.1 as it was non-breaking, so please check-out latest release of the DataFrames.jl package.

Thanks for your quick responses.

Alof of these ideas are already implemented in FastGroupBy.jl I just need to update it for Julia v1 and we can start benchmarking! I think the InternedString.jl will yield HUGE improvements for string grouping, because once the strings are interned, we can group by performing a radix-sort on the pointers (which are just UInt64), which is super fast! In fact, I have created a faster radix-sort than data.table’s in SortingLab.jl; again I need to update it to Julia v1!

Can’t wait to release my updated benchmarks on Julia v1, but before that there is advent of code, work, set up personal website, disk.frame … I wish I can land a full time job to focus on the data ecosystem in Julia. Willing to take pay cut :slight_smile:.

5 Likes

I have updated https://github.com/bkamins/Julia-DataFrames-Tutorial to DataFrames 0.15.2. You will see a bunch of deprecation warnings there (not that much but still a bit distirbing).
They will go away in DataFrames 0.16, but I released it now to highlight the upcoming changes in the underlying mechanics (mainly how getindex, veiw and eachcol work).

7 Likes

DataFrames.jl 0.16 has been released. You can find the list of changes in the release notes here Release Version 0.16.0 Β· JuliaData/DataFrames.jl Β· GitHub. The key change is finishing deprecation period for getindex and view methods.

The tutorial at https://github.com/bkamins/Julia-DataFrames-Tutorial has been updated.

9 Likes

DataFrames.jl 0.17 has been released (and if you are interested in these updated please subscribe to this thread for the future).

The release notes give the details of the changes. The most important changes are:

  • Improved performance of split-apply-combine functions (:+1: for @nalimilan).
  • view is now fully functional and fast in all cases (we always remember the parent DataFrame without creating copies).
  • Improved showing of data frames related types.
  • Added push! for DataFrameRow.
  • DataFrame constructor accepts SubDataFrame and DataFrameRow objects.

My tutorial https://github.com/bkamins/Julia-DataFrames-Tutorial has been updated to include the new functionality.

17 Likes

A small, but relevant update. With this version of DataFrames.jl we also re-export a new version of Missings.jl, which introduces passmissing function. We would welcome a feedback on this functionality (as it might eventually be considered for inclusion into Julia Base).

The passmissing function wraps any Julia function f (typically missing β€œunaware”) accepting positional arguments so that if any of them is missing it returns missing and otherwise calls f with the passed arguments. Here is an example showing you the difference:

julia> string(missing, " ", missing)
"missing missing"

julia> passmissing(string)(missing, " ", missing)
missing
7 Likes

We have released DataFrames.jl version 0.17.1. Here you can read release notes.

Apart from polishing corner cases, the thing that a typical user might be interested in is that two convenience functions for working with GroupedDataFrame were introduced:

  • groupvars: gives you a vector of names of columns in the parent data frame used for grouping
  • groupindices: gives you a vector of row group indices in the parent data frame

https://github.com/bkamins/Julia-DataFrames-Tutorial has been updated to reflect the changes in the release.

6 Likes

We have released DataFrames.jl version 0.18.0. Here you can read release notes.

This is a big release. The main highlights are:

  • functions that create a new DataFrame copy passed columns by default; this can be overridden by copycols keyword argument or by using the DataFrame! function that does not copy passed columns; this means that we take safety first approach and you have to opt in for speed when creating a new DataFrame; we believe that this is a good choice as even experienced users of the package were getting caught by the old behavior (and it is especially targeted at making the usage less error prone for entry-level users);
  • in order to simplify the work with data frames after the change of default DataFrame construction there were significant improvements in append!, push! and vcat methods to make the workflow using them more smooth;
  • following requests we have thinned the dependency list of DataFrames.jl which should improve its loading times (the change is mildly breaking so if anyone experiences serious problems because of this please report an issue in DataFrames.jl);
  • there is a long list of finished deprecation periods for functions and new deprecation messages; the reason is that we want to clean up the package before 1.0 release (also the list of deprecated features got quite long over the years, we will continue with this process also in the coming releases with the objective to have no deprecated functionalities in 1.0 release);
  • there are significant improvements in various show method implementations
  • optimized methods for PooledArrays in split-apply-combine were added

I will update Julia-DataFrames-Tutorial soon (and add a link here when it is updated).

In the coming release the next big thing (hopefully) will be making setindex! and broadcasting behavior of data frames consistent with Julia 1.0 standards.

Update

I have updated the tutorial here.
There were some minor rough edges reported (here and here) so we will soon make a patch release.

20 Likes

A patch release 0.18.1 of DataFrames.jl has just been registered. The release makes DataFrame constructor behave correctly when passed some objects that met Tables.jl interface (most importantly database connections). Thanks for @cadoubs and @quinnj for helping with the issue.

3 Likes

Thank you for your fast analysis and patch. I validate the update on my application with MySQL.jl
You make a very nice job.

1 Like

I have just stumbled on an old question on StackOverflow that lacked a good answer for over two years. It is related to a conversion from JSON to a columnar data structure (looking at the number of stars on the question I guess it is a pattern that is needed quite often).

Thanks to the new vcat functionality proposed by pdeffebach and @oxinabox in DataFrames 0.18 we can do such things cleanly.

7 Likes

DataFrames.jl v0.18.2 patch release was just merged. Here you can read about the changes it introduces.

1 Like

DataFrames.jl v0.19.0 has just been released. It is a major release towards DataFrames.jl 1.0 (we cannot get there yet as we have to go through deprecation cycle).

The number of changes is significant and includes:

API changes:

  • allow Regex indexing of columns
  • allow Not from InvertedIndices.jl indexing of rows and columns
  • add ! indexing of rows of AbstractDataFrame
  • deprecate indexing with column or columns only (like df[:a] or df[1:2] )
  • define target rules for getindex , getproperty, setindex! , and setproperty! for AbstractDataFrame and DataFrameRow` (in this release old behavior is deprecated; in the next release wit will get replaced by target functionality)
  • add indexing using CartesianIndex{2} for AbstractDataFrame
  • full support of broadcasting for AbstractDataFrame
  • support for broadcasting assignment for DataFrameRow
  • keys(::DataFrameRow) now returns a Tuple of column names
  • added get and map methods for DataFrameRow
  • categorical! now accepts columns that contain missing values
  • get and haskey for AbstractDataFrame is deprecated now
  • empty! for DataFrame is deprecated now
  • add hasproperty for AbstractDataFrame

Fixes:

  • improved showind DataFrameRow with zero columns
  • fix combine with aggregation when skipmissing=true

Minor changes:

  • improvements in error messages and types of thrown exceptions on error
  • various documentation improvements
  • improved getindex speed for vector of Bool indexing
  • remove InteractiveUtils.jl dependency

The major change is change of indexing rules and full support for broadcasting. Here are the details. In general in the design there was a tension between: ease of use, flexibility, safety and consistency.

Here are the major highlights:

  • you can use Not and Regex for column indexing
  • df[col] is now df[!, col] and gets/replaces a column in a data frame β€œas is”
  • df[:, col] will always get a copy of a column/set a column in place
  • df[cols] is now df[!, cols] and gets a new data frame without copying of columns
  • df[:, cols] and gets a new data frame with copying of columns
  • df.col is the same as df[!, col] for consistency with Base indicating that it gives you β€œas is” access to the property of the data frame (i.e. it gives you the column without copying and replaces the column)
  • data frames can take part in broadcasting
  • You can perform broadcasting assignment to AbstractDataFrame and DataFrameRow; as a special rule: using df[!, col] syntax you can create a new column/replace old one using broadcasting (something which is non standard in regular broadcasting which is always in-place).

In summary ! indicates β€œan unsafe” operation. The reason is that people often were tricked by getting columns of a data frame, mutating them (e.g. resizing or sorting), and in consequence corrupting the source data frame. Now we hope that ! will serve them as a warning that this is not a safe operation (as opposed to : indexing which always makes a copy).

Here are the new rules at work:

julia> df = DataFrame(x1=1:3, x2=2:4, y='a':'c')
3Γ—3 DataFrame
β”‚ Row β”‚ x1    β”‚ x2    β”‚ y    β”‚
β”‚     β”‚ Int64 β”‚ Int64 β”‚ Char β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1     β”‚ 2     β”‚ 'a'  β”‚
β”‚ 2   β”‚ 2     β”‚ 3     β”‚ 'b'  β”‚
β”‚ 3   β”‚ 3     β”‚ 4     β”‚ 'c'  β”‚

julia> select(df, r"x")
3Γ—2 DataFrame
β”‚ Row β”‚ x1    β”‚ x2    β”‚
β”‚     β”‚ Int64 β”‚ Int64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1     β”‚ 2     β”‚
β”‚ 2   β”‚ 2     β”‚ 3     β”‚
β”‚ 3   β”‚ 3     β”‚ 4     β”‚

julia> select(df, Not(r"x"))
3Γ—1 DataFrame
β”‚ Row β”‚ y    β”‚
β”‚     β”‚ Char β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 'a'  β”‚
β”‚ 2   β”‚ 'b'  β”‚
β”‚ 3   β”‚ 'c'  β”‚

julia> df[Not(1), Not(1)]
2Γ—2 DataFrame
β”‚ Row β”‚ x2    β”‚ y    β”‚
β”‚     β”‚ Int64 β”‚ Char β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 3     β”‚ 'b'  β”‚
β”‚ 2   β”‚ 4     β”‚ 'c'  β”‚

julia> df .+ 1
3Γ—3 DataFrame
β”‚ Row β”‚ x1    β”‚ x2    β”‚ y    β”‚
β”‚     β”‚ Int64 β”‚ Int64 β”‚ Char β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 2     β”‚ 3     β”‚ 'b'  β”‚
β”‚ 2   β”‚ 3     β”‚ 4     β”‚ 'c'  β”‚
β”‚ 3   β”‚ 4     β”‚ 5     β”‚ 'd'  β”‚

julia> df .+= ones(Int, size(df))
3Γ—3 DataFrame
β”‚ Row β”‚ x1    β”‚ x2    β”‚ y    β”‚
β”‚     β”‚ Int64 β”‚ Int64 β”‚ Char β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 2     β”‚ 3     β”‚ 'b'  β”‚
β”‚ 2   β”‚ 3     β”‚ 4     β”‚ 'c'  β”‚
β”‚ 3   β”‚ 4     β”‚ 5     β”‚ 'd'  β”‚

julia> df[!, :z] .= 1
3-element Array{Int64,1}:
 1
 1
 1

julia> df
3Γ—4 DataFrame
β”‚ Row β”‚ x1    β”‚ x2    β”‚ y    β”‚ z     β”‚
β”‚     β”‚ Int64 β”‚ Int64 β”‚ Char β”‚ Int64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 2     β”‚ 3     β”‚ 'b'  β”‚ 1     β”‚
β”‚ 2   β”‚ 3     β”‚ 4     β”‚ 'c'  β”‚ 1     β”‚
β”‚ 3   β”‚ 4     β”‚ 5     β”‚ 'd'  β”‚ 1     β”‚
28 Likes

this is great great stuff! Is there anywhere where you guys have listed the goals for 1.0, or roughly what is going to be materially different from the structure of a DataFrame now?

https://github.com/JuliaData/DataFrames.jl/issues/1678

2 Likes