Release announcements for DataFrames.jl

Regarding view vs ! the difference is the following:

! allows you to mutate the bindings in internal structure of df. Therefore when you write:

df[!, :col] = vector

you can create a column if it does not exist and replace it if it exists. The same is with broadcasting assignment like:

df[!, :col] .= x

using a view would not allow this.

Now for getindex side, there is an ongoing difference what we should do (see Make `getproperty(df, col)` return a full length view of the column ยท Issue #1844 ยท JuliaData/DataFrames.jl ยท GitHub). First, the current major difference is speed (this probably could be improved):

julia> @benchmark df[!, 1]
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.746 ns (0.00% GC)
  median time:      15.948 ns (0.00% GC)
  mean time:        18.275 ns (0.00% GC)
  maximum time:     157.673 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

julia> @benchmark @view df[:, 1]
BenchmarkTools.Trial:
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     308.907 ns (0.00% GC)
  median time:      312.146 ns (0.00% GC)
  mean time:        358.267 ns (0.66% GC)
  maximum time:     5.262 ฮผs (93.72% GC)
  --------------
  samples:          10000
  evals/sample:     247

then the view and the raw-access to a vector is not always the same thing type-wise (it may affect method dispatch downstream, it does for example for CategoricalArray).

Finally - people already complained that df[!, :col] is longer to write than df[:col] and @view df[:, :col] would be even longer to write.

In summary - we needed ! for โ€œwriteโ€ side of the getindex/setindex!/broadcasting assignment combo anyway. Then we had do define what it does for โ€œreadโ€ side to be consistent. As of today we have decided that it will do the same what df[:col] did (as removing an equivalent of df[:col] for reading data from a data frame would: hurt performance and lead to code breakage).

Note that with what we have implemented you simply add !, to your code and you know it will work as it used to without having to thing about it; if we switched to @view approach - which was considered - a lot of code would be broken; Actually the deprecation of df[:col] would be parent(@view df[:, col]) but we considered that this would not be acceptable. With a package like DataFrames.jl we had to consider the fact that people have 5+ years of accumulated code using it and if we were going to be breaking (which we decided to do) the โ€œfixingโ€ should be easy and not noisy visually.

9 Likes

What about df[rows, :] vs df[rows, !] ?
Is it the same story for rows?

No - ! is only needed for row selector.
The reason is that internally DataFrame is a collection of columns (not rows).

For column selector you can use:

  • a Symbol
  • an Integer other than Bool
  • a vector of Integers other than Bool (also with abstract eltype)
  • a vector of Bool
  • a vector of Symbol
  • a Colon
  • a Regex
  • a Not expression accepting any of the above as the argument

(which I hope is flexible enough to cover all use-cases in a user-friendly way)

3 Likes

Overall, I think the changes are great. One thing I miss, however, is the ability to assign a scalar to a new column. The following no longer works:

using DataFrames
df = DataFrame()
v = .3
df[!,:a] = v

ERROR: MethodError: no method matching setindex!(::DataFrame, ::Float64, ::typeof(!), ::Symbol)
Closest candidates are:
  setindex!(::DataFrame, ::AbstractArray{T,1} where T, ::typeof(!), ::Union{Signed, Symbol, Unsigned}) at /home/dfish/.julia/packages/DataFrames/GoFnP/src/dataframe/dataframe.jl:465
  setindex!(::DataFrame, ::Any, ::Colon, ::Any) at /home/dfish/.julia/packages/DataFrames/GoFnP/src/deprecated.jl:1595
  setindex!(::DataFrame, ::Any, ::Integer, ::Union{Signed, Symbol, Unsigned}) at /home/dfish/.julia/packages/DataFrames/GoFnP/src/dataframe/dataframe.jl:474
  ...
Stacktrace:
 [1] top-level scope at none:0

The solution is to wrap v in an array like so: [v]. The downside is that there is no elegant way to use the same code when v could be a scalar or vector. Is there a reason that this is no longer possible?

2 Likes

You can broadcast to construct a column of the scalar once the DataFrame has something in it.

julia> v = .3
julia> df[!, :b] = rand(100);
julia> df[!, :a] .= v
100-element Array{Float64,1}:
 0.3
 0.3
 0.3
 0.3
 0.3
 0.3
 0.3
 0.3
 0.3
 0.3
 โ‹ฎ  
 0.3
 0.3
 0.3
 0.3
 0.3
 0.3
 0.3
 0.3
 0.3

Or did you actually want the DataFrame to hold a single scalar, instead of an array?

Yeah, the latter. In previous versions, it would promote a scalar to an array and create a new column, assuming the dataframe is empty or has only one row. Maybe DataFrames functions more consistently without this behavior.

There is already a PR for an empty data frame case here https://github.com/JuliaData/DataFrames.jl/pull/1890.

In consequence writing:

df = DataFrame()
df[!, :col] .= 1

will be possible but will create a 0-element vector. The reasoning behind it is that df has 0-rows so adding a column to it should keep it this way.

Do you think a data frame with zero columns should have a different behavior than an empty data frame with at least one column here?

Also - is this case really a significant one? (I am aware that it might be popular when learning the package but does it happen in real cases).

Also note that you can do the following push!(df, (a=1,)), where df was created using DataFrame().

1 Like

Just for a reference. You can assign to a new column in general like:

df = DataFrame(a=[1,2,3])
df[!, :b] .= 1

works OK. The only problematic cases were zero column data frame and an empty data frame with more than zero columns (they are currently handled in the same way in the PR I have mentioned).

Thanks for the explanation. In your example, my naive expectation would be that df[!,:col] .= 1 creates a column of Int64[] called col with a single element of 1:

 df
  1ร—1 DataFrame
โ”‚ Row โ”‚ a     โ”‚
โ”‚     โ”‚ Int64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚

It is somewhat of an edge case. Although I would expect it to create a new column and assign 1, that may not be consistent with the current behavior, where it would create a column of size R where R is the number of rows. For example:

using DataFrames
df = DataFrame()
R = 10
df[!,:a] = rand( R)
df[!,:b] .= .4

df
  10ร—2 DataFrame
โ”‚ Row โ”‚ a        โ”‚ b       โ”‚
โ”‚     โ”‚ Float64  โ”‚ Float64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 0.859254 โ”‚ 0.4     โ”‚
โ”‚ 2   โ”‚ 0.931467 โ”‚ 0.4     โ”‚
โ”‚ 3   โ”‚ 0.276189 โ”‚ 0.4     โ”‚
โ”‚ 4   โ”‚ 0.695778 โ”‚ 0.4     โ”‚
โ”‚ 5   โ”‚ 0.436747 โ”‚ 0.4     โ”‚
โ”‚ 6   โ”‚ 0.478671 โ”‚ 0.4     โ”‚
โ”‚ 7   โ”‚ 0.220407 โ”‚ 0.4     โ”‚
โ”‚ 8   โ”‚ 0.778795 โ”‚ 0.4     โ”‚
โ”‚ 9   โ”‚ 0.614729 โ”‚ 0.4     โ”‚
โ”‚ 10  โ”‚ 0.825508 โ”‚ 0.4     โ”‚

Iโ€™m not entirely certain what should be done when R = 0 or R is undefined because the dataframe is empty On one hand, it seems odd to create an empty column when a specific value is being supplied, but assigning the value does not seem entirely consistent with the behavior above either. I guess, between the two options, I might choose to have it create the column and assign the value and use something like df[!,:a] .= Int64[ ] to create an empty column. Of course, that is my somewhat under-informed assessment, and I have not thought about all of the ramifications.

The current behavior is slightly inconvenient because I have to wrap scalars in an array, at least when creating the initial column, e.g. df = DataFrame(); df[!,:a] = [.3]. After that, as you pointed out in your second comment to me, it is possible to assign values subsequently. Thatโ€™s not a big deal if I always know where the first column will be created. What is more problematic is if v could be a scalar or vector, and the DataFrame is empty. As far as I know, this would require two sets of syntax to handle both cases, whereas before, it handled the assignment of both scalars and vectors.

I agree with you. Iโ€™m not sure how often this problem arises in actual code. I have encountered it from time to time in my work. This occurs when I need to extract data from various sources and integrate it with some container iteratively and flexibly (i.e. I may not know what the columns will be or what their sizes will ultimately be).

P.S. I couldnโ€™t adapt your example with push to my problem:

Case 1: scalar

df = DataFrame()
a = 1
push!(df,(a=a,))
 df
  1ร—1 DataFrame
โ”‚ Row โ”‚ a     โ”‚
โ”‚     โ”‚ Int64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚

Case 2: vector

df = DataFrame()
a= [1,2]
push!(df,(a=a,))
 df
  1ร—1 DataFrame
โ”‚ Row โ”‚ a      โ”‚
โ”‚     โ”‚ Arrayโ€ฆ โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ [1, 2] โ”‚

I tried various placements of the "..." operator, but it would not accommodate both cases.

We could treat DataFrame() as having undefined number of rows (as opposed to 0 rows).
Thinking about it we then should also allow:

df = DataFrame()
df[!, :c] .= [1,2,3]

(and create a 3 element vector - now it throws an error; I will update the PR I have mentioned above and we can discuss it further there)

PS. push! is only for scalars

1 Like

Thanks for your time. I found the discussion helpful for understanding DataFrames.

I have implemented the requested changes in https://github.com/JuliaData/DataFrames.jl/pull/1890. The functionality is tricky so if someone can have a look at it it would be appreciated.

2 Likes

Thank you. I will take a look.

Based on the feedback after 0.19.0 release we have a patch 0.19.1 release that introduces the following changes:

  • we drop StatsBase.jl dependency, which improves DataFrames.jl load time
  • push! and append! now make sure they do not produce an output data frame with unequal number of rows in columns (this was a common problem causing hard-to-catch bugs)
  • join, groupby and show-related functions now check if the data frame they work on is not internally corrupted (most often the problem is unequal number of rows in columns); this allows users to track possible column aliasing related bugs in their code easier
  • broadcasting now allows broadcasting into 0-row (empty) data frame and correctly handles broadcasting into a single cell of a data frame.

Thanks for all that reported problems and contributed!

You can see the detailed release notes here.

13 Likes

We have just released a 0.19.2 patch release. The full log of changes can be found here.

The most relevant changes are:

  • disallowmissing , allowmissing and categorical functions were added
  • unstack now accepts renamecols keyword argument providing a flexible mechanizm to generate column names of an unstacked data frame
4 Likes

I found some weird syntax in DataFrame 0.19.2โ€ฆ is it me doing wrong (i.e. โ€œis there a better way to do itโ€) or it is supposed to be like this ?

Letโ€™s assume the following df:

df = DataFrame(a=[1,2,3],b=[4,5,6],c=[7,8,9])

# Change column order
df = df[[:b, :a, :c]]    # before
df = df[:,[:b, :a, :c]]  # now

# Operate on a column whose name is memorised in a variable
C = :c; eltype(df[C])   # before 
C = :c; eltype(df[!,C]) # now

# Delete cols
deletecols!(df,:b)   # before
select!(df,Not(:b))  # now

What the new ! symbol stand for ? How df[!,:a] is different than df[:,:a] ?

! roughly stands for โ€œa direct access to a column stored in a data frameโ€, as opposed to : which stands for โ€œaccess to a copy of a column stored in a data frameโ€.

The detailed rules are laid out here.

The reason for this change is described above in this thread (around post 51), but in short a direct access to a column lead to many nasty bugs when using DataFrames.jl package (people accessed the column, then mutated it and they were not aware that they also mutated the contents of the source data frame). The ! symbol adds a bit of verbosity, but is meant to warn that the operation is potentially mutating (so this similar with the meaning of ! in Base in function names).

Note that instead of df[!, :col] you can just write df.col.

Also note that:

  • to change the order of columns you can use the permutecols! function which works in place; if you want to avoid copying of columns and create a new data frame use df[!, [:b, :a, :c]] (this is potentially unsafe though - as explained above)
  • deletecols! was deprecated because select! just does the job with Not indexing, so there would be a duplication of functionality (and select-family of functions is something that many people already know from other data processing ecosystems).
6 Likes

I write to say that in my opinion the new syntax is nice. The symbol ! inside the selection is a bit strange at first, but after you got use it has a lot of sense. The other changes are also good in mi opinion.

6 Likes

I have a small announcement. DataFrames Tutorial is for some time in sync with DataFrames.jl 0.19 release. But I have just added in the 04_loadsave.ipynb notebook examples of integration with JSONTables.jl.

@quinnj has done a fantastic job with JSONTables.jl and with the release 0.1.2 of this package you can safely read and write JSON data (both row- and column- oriented) to a DataFrame.

4 Likes