Release announcements for DataFrames.jl

Overall, I think the changes are great. One thing I miss, however, is the ability to assign a scalar to a new column. The following no longer works:

using DataFrames
df = DataFrame()
v = .3
df[!,:a] = v

ERROR: MethodError: no method matching setindex!(::DataFrame, ::Float64, ::typeof(!), ::Symbol)
Closest candidates are:
  setindex!(::DataFrame, ::AbstractArray{T,1} where T, ::typeof(!), ::Union{Signed, Symbol, Unsigned}) at /home/dfish/.julia/packages/DataFrames/GoFnP/src/dataframe/dataframe.jl:465
  setindex!(::DataFrame, ::Any, ::Colon, ::Any) at /home/dfish/.julia/packages/DataFrames/GoFnP/src/deprecated.jl:1595
  setindex!(::DataFrame, ::Any, ::Integer, ::Union{Signed, Symbol, Unsigned}) at /home/dfish/.julia/packages/DataFrames/GoFnP/src/dataframe/dataframe.jl:474
  ...
Stacktrace:
 [1] top-level scope at none:0

The solution is to wrap v in an array like so: [v]. The downside is that there is no elegant way to use the same code when v could be a scalar or vector. Is there a reason that this is no longer possible?

2 Likes

You can broadcast to construct a column of the scalar once the DataFrame has something in it.

julia> v = .3
julia> df[!, :b] = rand(100);
julia> df[!, :a] .= v
100-element Array{Float64,1}:
 0.3
 0.3
 0.3
 0.3
 0.3
 0.3
 0.3
 0.3
 0.3
 0.3
 โ‹ฎ  
 0.3
 0.3
 0.3
 0.3
 0.3
 0.3
 0.3
 0.3
 0.3

Or did you actually want the DataFrame to hold a single scalar, instead of an array?

Yeah, the latter. In previous versions, it would promote a scalar to an array and create a new column, assuming the dataframe is empty or has only one row. Maybe DataFrames functions more consistently without this behavior.

There is already a PR for an empty data frame case here https://github.com/JuliaData/DataFrames.jl/pull/1890.

In consequence writing:

df = DataFrame()
df[!, :col] .= 1

will be possible but will create a 0-element vector. The reasoning behind it is that df has 0-rows so adding a column to it should keep it this way.

Do you think a data frame with zero columns should have a different behavior than an empty data frame with at least one column here?

Also - is this case really a significant one? (I am aware that it might be popular when learning the package but does it happen in real cases).

Also note that you can do the following push!(df, (a=1,)), where df was created using DataFrame().

1 Like

Just for a reference. You can assign to a new column in general like:

df = DataFrame(a=[1,2,3])
df[!, :b] .= 1

works OK. The only problematic cases were zero column data frame and an empty data frame with more than zero columns (they are currently handled in the same way in the PR I have mentioned).

Thanks for the explanation. In your example, my naive expectation would be that df[!,:col] .= 1 creates a column of Int64[] called col with a single element of 1:

 df
  1ร—1 DataFrame
โ”‚ Row โ”‚ a     โ”‚
โ”‚     โ”‚ Int64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚

It is somewhat of an edge case. Although I would expect it to create a new column and assign 1, that may not be consistent with the current behavior, where it would create a column of size R where R is the number of rows. For example:

using DataFrames
df = DataFrame()
R = 10
df[!,:a] = rand( R)
df[!,:b] .= .4

df
  10ร—2 DataFrame
โ”‚ Row โ”‚ a        โ”‚ b       โ”‚
โ”‚     โ”‚ Float64  โ”‚ Float64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 0.859254 โ”‚ 0.4     โ”‚
โ”‚ 2   โ”‚ 0.931467 โ”‚ 0.4     โ”‚
โ”‚ 3   โ”‚ 0.276189 โ”‚ 0.4     โ”‚
โ”‚ 4   โ”‚ 0.695778 โ”‚ 0.4     โ”‚
โ”‚ 5   โ”‚ 0.436747 โ”‚ 0.4     โ”‚
โ”‚ 6   โ”‚ 0.478671 โ”‚ 0.4     โ”‚
โ”‚ 7   โ”‚ 0.220407 โ”‚ 0.4     โ”‚
โ”‚ 8   โ”‚ 0.778795 โ”‚ 0.4     โ”‚
โ”‚ 9   โ”‚ 0.614729 โ”‚ 0.4     โ”‚
โ”‚ 10  โ”‚ 0.825508 โ”‚ 0.4     โ”‚

Iโ€™m not entirely certain what should be done when R = 0 or R is undefined because the dataframe is empty On one hand, it seems odd to create an empty column when a specific value is being supplied, but assigning the value does not seem entirely consistent with the behavior above either. I guess, between the two options, I might choose to have it create the column and assign the value and use something like df[!,:a] .= Int64[ ] to create an empty column. Of course, that is my somewhat under-informed assessment, and I have not thought about all of the ramifications.

The current behavior is slightly inconvenient because I have to wrap scalars in an array, at least when creating the initial column, e.g. df = DataFrame(); df[!,:a] = [.3]. After that, as you pointed out in your second comment to me, it is possible to assign values subsequently. Thatโ€™s not a big deal if I always know where the first column will be created. What is more problematic is if v could be a scalar or vector, and the DataFrame is empty. As far as I know, this would require two sets of syntax to handle both cases, whereas before, it handled the assignment of both scalars and vectors.

I agree with you. Iโ€™m not sure how often this problem arises in actual code. I have encountered it from time to time in my work. This occurs when I need to extract data from various sources and integrate it with some container iteratively and flexibly (i.e. I may not know what the columns will be or what their sizes will ultimately be).

P.S. I couldnโ€™t adapt your example with push to my problem:

Case 1: scalar

df = DataFrame()
a = 1
push!(df,(a=a,))
 df
  1ร—1 DataFrame
โ”‚ Row โ”‚ a     โ”‚
โ”‚     โ”‚ Int64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚

Case 2: vector

df = DataFrame()
a= [1,2]
push!(df,(a=a,))
 df
  1ร—1 DataFrame
โ”‚ Row โ”‚ a      โ”‚
โ”‚     โ”‚ Arrayโ€ฆ โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ [1, 2] โ”‚

I tried various placements of the "..." operator, but it would not accommodate both cases.

We could treat DataFrame() as having undefined number of rows (as opposed to 0 rows).
Thinking about it we then should also allow:

df = DataFrame()
df[!, :c] .= [1,2,3]

(and create a 3 element vector - now it throws an error; I will update the PR I have mentioned above and we can discuss it further there)

PS. push! is only for scalars

1 Like

Thanks for your time. I found the discussion helpful for understanding DataFrames.

I have implemented the requested changes in https://github.com/JuliaData/DataFrames.jl/pull/1890. The functionality is tricky so if someone can have a look at it it would be appreciated.

2 Likes

Thank you. I will take a look.

Based on the feedback after 0.19.0 release we have a patch 0.19.1 release that introduces the following changes:

  • we drop StatsBase.jl dependency, which improves DataFrames.jl load time
  • push! and append! now make sure they do not produce an output data frame with unequal number of rows in columns (this was a common problem causing hard-to-catch bugs)
  • join, groupby and show-related functions now check if the data frame they work on is not internally corrupted (most often the problem is unequal number of rows in columns); this allows users to track possible column aliasing related bugs in their code easier
  • broadcasting now allows broadcasting into 0-row (empty) data frame and correctly handles broadcasting into a single cell of a data frame.

Thanks for all that reported problems and contributed!

You can see the detailed release notes here.

12 Likes

We have just released a 0.19.2 patch release. The full log of changes can be found here.

The most relevant changes are:

  • disallowmissing , allowmissing and categorical functions were added
  • unstack now accepts renamecols keyword argument providing a flexible mechanizm to generate column names of an unstacked data frame
4 Likes

I found some weird syntax in DataFrame 0.19.2โ€ฆ is it me doing wrong (i.e. โ€œis there a better way to do itโ€) or it is supposed to be like this ?

Letโ€™s assume the following df:

df = DataFrame(a=[1,2,3],b=[4,5,6],c=[7,8,9])

# Change column order
df = df[[:b, :a, :c]]    # before
df = df[:,[:b, :a, :c]]  # now

# Operate on a column whose name is memorised in a variable
C = :c; eltype(df[C])   # before 
C = :c; eltype(df[!,C]) # now

# Delete cols
deletecols!(df,:b)   # before
select!(df,Not(:b))  # now

What the new ! symbol stand for ? How df[!,:a] is different than df[:,:a] ?

! roughly stands for โ€œa direct access to a column stored in a data frameโ€, as opposed to : which stands for โ€œaccess to a copy of a column stored in a data frameโ€.

The detailed rules are laid out here.

The reason for this change is described above in this thread (around post 51), but in short a direct access to a column lead to many nasty bugs when using DataFrames.jl package (people accessed the column, then mutated it and they were not aware that they also mutated the contents of the source data frame). The ! symbol adds a bit of verbosity, but is meant to warn that the operation is potentially mutating (so this similar with the meaning of ! in Base in function names).

Note that instead of df[!, :col] you can just write df.col.

Also note that:

  • to change the order of columns you can use the permutecols! function which works in place; if you want to avoid copying of columns and create a new data frame use df[!, [:b, :a, :c]] (this is potentially unsafe though - as explained above)
  • deletecols! was deprecated because select! just does the job with Not indexing, so there would be a duplication of functionality (and select-family of functions is something that many people already know from other data processing ecosystems).
6 Likes

I write to say that in my opinion the new syntax is nice. The symbol ! inside the selection is a bit strange at first, but after you got use it has a lot of sense. The other changes are also good in mi opinion.

6 Likes

I have a small announcement. DataFrames Tutorial is for some time in sync with DataFrames.jl 0.19 release. But I have just added in the 04_loadsave.ipynb notebook examples of integration with JSONTables.jl.

@quinnj has done a fantastic job with JSONTables.jl and with the release 0.1.2 of this package you can safely read and write JSON data (both row- and column- oriented) to a DataFrame.

4 Likes

FWIW, I think it would really be worth adding a Project + Manifest to that repo. Then there is no problem getting the exact versions that were used when running the notebooks.

3 Likes

Good point - I will add this (I started doing these tutorials before Project.toml was an option).

2 Likes

If you have a few minutes to spare, @bkamins, could you cast your expert eye over the wiki books chapter to see if Iโ€™ve made any howlers? It dates back a long way (2015) but Iโ€™m sentimental and donโ€™t want to just delete the whole section and point to your (much better) material. (Besides, Iโ€™d have to redo all the links. :scream:)

(Just mention any problems, donโ€™t waste time learning the Wikibooks markup language . :joy:)

2 Likes