Release announcements for DataFrames.jl

DataFrames.jl v0.19.0 has just been released. It is a major release towards DataFrames.jl 1.0 (we cannot get there yet as we have to go through deprecation cycle).

The number of changes is significant and includes:

API changes:

  • allow Regex indexing of columns
  • allow Not from InvertedIndices.jl indexing of rows and columns
  • add ! indexing of rows of AbstractDataFrame
  • deprecate indexing with column or columns only (like df[:a] or df[1:2] )
  • define target rules for getindex , getproperty, setindex! , and setproperty! for AbstractDataFrame and DataFrameRow` (in this release old behavior is deprecated; in the next release wit will get replaced by target functionality)
  • add indexing using CartesianIndex{2} for AbstractDataFrame
  • full support of broadcasting for AbstractDataFrame
  • support for broadcasting assignment for DataFrameRow
  • keys(::DataFrameRow) now returns a Tuple of column names
  • added get and map methods for DataFrameRow
  • categorical! now accepts columns that contain missing values
  • get and haskey for AbstractDataFrame is deprecated now
  • empty! for DataFrame is deprecated now
  • add hasproperty for AbstractDataFrame

Fixes:

  • improved showind DataFrameRow with zero columns
  • fix combine with aggregation when skipmissing=true

Minor changes:

  • improvements in error messages and types of thrown exceptions on error
  • various documentation improvements
  • improved getindex speed for vector of Bool indexing
  • remove InteractiveUtils.jl dependency

The major change is change of indexing rules and full support for broadcasting. Here are the details. In general in the design there was a tension between: ease of use, flexibility, safety and consistency.

Here are the major highlights:

  • you can use Not and Regex for column indexing
  • df[col] is now df[!, col] and gets/replaces a column in a data frame “as is”
  • df[:, col] will always get a copy of a column/set a column in place
  • df[cols] is now df[!, cols] and gets a new data frame without copying of columns
  • df[:, cols] and gets a new data frame with copying of columns
  • df.col is the same as df[!, col] for consistency with Base indicating that it gives you “as is” access to the property of the data frame (i.e. it gives you the column without copying and replaces the column)
  • data frames can take part in broadcasting
  • You can perform broadcasting assignment to AbstractDataFrame and DataFrameRow; as a special rule: using df[!, col] syntax you can create a new column/replace old one using broadcasting (something which is non standard in regular broadcasting which is always in-place).

In summary ! indicates “an unsafe” operation. The reason is that people often were tricked by getting columns of a data frame, mutating them (e.g. resizing or sorting), and in consequence corrupting the source data frame. Now we hope that ! will serve them as a warning that this is not a safe operation (as opposed to : indexing which always makes a copy).

Here are the new rules at work:

julia> df = DataFrame(x1=1:3, x2=2:4, y='a':'c')
3×3 DataFrame
│ Row │ x1    │ x2    │ y    │
│     │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 1   │ 1     │ 2     │ 'a'  │
│ 2   │ 2     │ 3     │ 'b'  │
│ 3   │ 3     │ 4     │ 'c'  │

julia> select(df, r"x")
3×2 DataFrame
│ Row │ x1    │ x2    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 2     │
│ 2   │ 2     │ 3     │
│ 3   │ 3     │ 4     │

julia> select(df, Not(r"x"))
3×1 DataFrame
│ Row │ y    │
│     │ Char │
├─────┼──────┤
│ 1   │ 'a'  │
│ 2   │ 'b'  │
│ 3   │ 'c'  │

julia> df[Not(1), Not(1)]
2×2 DataFrame
│ Row │ x2    │ y    │
│     │ Int64 │ Char │
├─────┼───────┼──────┤
│ 1   │ 3     │ 'b'  │
│ 2   │ 4     │ 'c'  │

julia> df .+ 1
3×3 DataFrame
│ Row │ x1    │ x2    │ y    │
│     │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 1   │ 2     │ 3     │ 'b'  │
│ 2   │ 3     │ 4     │ 'c'  │
│ 3   │ 4     │ 5     │ 'd'  │

julia> df .+= ones(Int, size(df))
3×3 DataFrame
│ Row │ x1    │ x2    │ y    │
│     │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 1   │ 2     │ 3     │ 'b'  │
│ 2   │ 3     │ 4     │ 'c'  │
│ 3   │ 4     │ 5     │ 'd'  │

julia> df[!, :z] .= 1
3-element Array{Int64,1}:
 1
 1
 1

julia> df
3×4 DataFrame
│ Row │ x1    │ x2    │ y    │ z     │
│     │ Int64 │ Int64 │ Char │ Int64 │
├─────┼───────┼───────┼──────┼───────┤
│ 1   │ 2     │ 3     │ 'b'  │ 1     │
│ 2   │ 3     │ 4     │ 'c'  │ 1     │
│ 3   │ 4     │ 5     │ 'd'  │ 1     │
28 Likes