Release announcements for DataFrames.jl

No, CategoricalArrays and DataFrames no longer know anything about one another. You should do

using DataFrames
using CategoricalArrays
df = DataFrame(a = [1, 1, 1, 2, 2, 2])
transform!(df, :a => categorical => :a)

to get the equivalent of categorical!(df, :a)

1 Like
  1. you can define your own categorical and categorical! methods as they are pretty simple to add in your own code if you like them (one-liners essentially)
  2. We could ask @kristoffer.carlsson what is the status of conditional-dependencies as having them would resolve this problem entirely (probably https://github.com/JuliaLang/Pkg.jl/issues/1285 captures it)
5 Likes

I’m looking forward to seeing the new improvements on join and groupby reflected at Database-like ops benchmark

And I’ve found there is a new kid on the block, called Polars, which seems to perform really well.
https://pypi.org/project/py-polars/

1 Like

We have just made a 0.22.6 patch release. Hopefully it is the last release before 1.0 release.

We have decided to make 0.22.6 release to deprecate some outstanding things that should be removed in 1.0 release, but were missed in the 0.22 release process. These functionalities are on the border of being an error, but since they worked in the past we have decided to have a release that will allow users to go through deprecation cycle (although we assume that most likely the deprecated methods are not used). You can read the release announcement here.

Here is a brief summary of what decisions is 0.22.6 release motivated by:

  • most of the convert methods for types in DataFrames.jl package are deprecated; the reason is that they were inconsistent with convert contract from Julia Base. They were left from the past style where constructors were falling back to convert. The only conversions that are left are from DataFrameRow and GroupKey to NamedTuple and from SubDataFrame to DataFrame.
  • we have concluded not to make AbstractDataFrame an iterator of rows. Still, some methods, like filter will work as-if it were an iterator of rows for user convenience. If you need an iterator of rows use eachrow or Tables.namedtupleiterator functions. This has additionally lead to deprecation of map on GroupedDataFrame as currently it produced a vector, while in the future we might want it to return some other object (for the same reason broadcasting of GroupedDataFrame object is also disallowed - now we have made this consistent).

Therefore we are heading towards 1.0 release with three inconsistencies in the design I am aware of (these things unfortunately have to be learned as exceptions):

  • filter accepts :col => predicate where predicate is applied per row, as opposed to all other places in the DataFrames.jl ecosystem, where it works on whole columns (we decided to leave this inconsistency for user convenience);
  • df[row, cols] produces a DataFrameRow, which is a view; normally a copy should be produced, but making a copy is inefficient, ane would be breaking; we believe that leaving this inconsistency will not lead to bugs in code, and in practice it is more convenient than making a copy;
  • df.col .= value assignment currently operates in-place, which is inconsistent with the whole design of DataFrames.jl (it should replace the column); this will be fixed in the future, see add support for getproperty broadcasting by bkamins · Pull Request #2655 · JuliaData/DataFrames.jl · GitHub; however, we are here limited by the Julia Base functionality, and the underlying infrastructure we needed to make things work properly will be only available in Julia 1.7 release, see https://github.com/JuliaLang/julia/pull/39473.
13 Likes

Is it possible to have filter which operates on the whole column? Or at least function with the same functionality but other name. If such a function can also accept DataFrame as a first argument, it would be even better.

1 Like

There is such a function implemented already, it is called subset. It will be included in 1.0 release.

4 Likes

Does 0.22.6 include faster joins?

1 Like

No. This is something we discussed, but decided to leave-out all changes that include multi-threading from 0.22 branch. I encourage you to use ]add DataFrames#main as it should be safe to test.

3 Likes

We have finalized DataFrames.jl 1.0 development. Now the only things left are two clean-up operations (Make PrettyTables.jl 1.0 as a dependency · Issue #2714 · JuliaData/DataFrames.jl · GitHub and https://github.com/JuliaData/DataFrames.jl/issues/2642). They should happen this week (we need PrettyTables.jl 1.0 released before finalizing - this is in sync with @Ronis_BR).

Therefore you can assume that checking out ] add DataFrames#main is a way to beta test the package before 1.0 release, which we would highly appreciate - any issues you would report are welcome.
When doing the testing please kindly keep two things in mind:

  • test with depwarn=yes enabled to make sure you do not miss any deprecated functionality
  • also test with -t auto (or -t N) enabled as we have added multi threading support to DataFrames.jl

As this release brings a lot of internal redesign we would highly appreciate both correctness and performance testing to ensure we do not have any regressions.

Thank you for all your support!

47 Likes

Will multithreading be supported for groupby and join operations in version 1.0?

2 Likes

In general yes. But some operations are still single threaded. The major blocker to making all operations multithreaded is that Dict from Julia Base is not thread safe for writing (we would prefer to use standard implementation for this functionality).

4 Likes

In general yes. But some operations are still single threaded. The major blocker to making all operations multithreaded is that Dict from Julia Base is not thread safe for writing (we would prefer to use standard implementation for this functionality).

How about petitioning Julia Base to create a new data structure called ThreadSafeDict to replace Dict, then you can easily fix the problem by find and replace all Dict with ThreadSafeDict

We have already discussed this. The issue is not that simple. In general there are many standard structures and function in Julia Base that potentially can be multi-threaded but are currently single threaded.

Regarding Dict. The issue is not that we would want just ThreadSafeDict as such dictionary would have to use locking for making sure parallel reading and writing is safe. What we need is lock-free parallel writing to a dictionary data structure (i.e. for cases where writing happens in parallel and reading happens in parallel, but we know that writing and reading cannot happen at the same time). Most likely Julia Base devs consider such use case as too specific to include in Julia Base. We will see.

In general we are going to focus on performance of DataFrames.jl after 1.0 so in the worst case we will have our own implementation if it is needed. It will just take time (note how long it took to stabilize DataFrames.jl API enough to warrant its 1.0 release).

10 Likes

Out of curiosity, can’t you just use Vector{Dict} of the length nthreads? Each thread can write to its own dictionary and you can read from the whole Vector if needed. Of course, you still need some sort of conflict resolution, but it is really dependant on your case, I do not think that it is possible to invent generic enough rules so everyone is satisfied.

1 Like

This is exactly what we plan to do, but probably NTuple rather than Vector and use sharding based on key hash modulo nthreads.

5 Likes

This is awesome. Hope we will see it as a separate package one day, it will be interesting to toy with something like that.

1 Like

So what is the suggested way for fast mutation of a column?

1 Like

df[:, :col] .= value. The point is that df[:, :col] .= value has to stick to eltype of :col while df.col .= value will allow to assign any value to :col (at the cost of allocating).

Note that it will be an issue starting with 1.7 release of Julia, as currently the language does not allow for this distinction.

3 Likes

To clarif, sdf[!, :col] .= value will work for a SubDataFrame right?

1 Like

Ah - sorry. Clearly it should be : not !, as ! is for replacement (and in target API df.col .= value and df[!, :col] .= value will be the same). I have edited my comment. (too many things happen in parallel)

3 Likes