Release announcements for DataFrames.jl

pdeffebach · December 1, 2020, 9:18pm

No, CategoricalArrays and DataFrames no longer know anything about one another. You should do

using DataFrames
using CategoricalArrays
df = DataFrame(a = [1, 1, 1, 2, 2, 2])
transform!(df, :a => categorical => :a)

to get the equivalent of categorical!(df, :a)

bkamins · December 1, 2020, 9:54pm

you can define your own categorical and categorical! methods as they are pretty simple to add in your own code if you like them (one-liners essentially)
We could ask @kristoffer.carlsson what is the status of conditional-dependencies as having them would resolve this problem entirely (probably https://github.com/JuliaLang/Pkg.jl/issues/1285 captures it)

Juan · March 20, 2021, 12:14am

I’m looking forward to seeing the new improvements on join and groupby reflected at Database-like ops benchmark

And I’ve found there is a new kid on the block, called Polars, which seems to perform really well.
https://pypi.org/project/py-polars/

bkamins · March 25, 2021, 12:49pm

We have just made a 0.22.6 patch release. Hopefully it is the last release before 1.0 release.

We have decided to make 0.22.6 release to deprecate some outstanding things that should be removed in 1.0 release, but were missed in the 0.22 release process. These functionalities are on the border of being an error, but since they worked in the past we have decided to have a release that will allow users to go through deprecation cycle (although we assume that most likely the deprecated methods are not used). You can read the release announcement here.

Here is a brief summary of what decisions is 0.22.6 release motivated by:

most of the convert methods for types in DataFrames.jl package are deprecated; the reason is that they were inconsistent with convert contract from Julia Base. They were left from the past style where constructors were falling back to convert. The only conversions that are left are from DataFrameRow and GroupKey to NamedTuple and from SubDataFrame to DataFrame.
we have concluded not to make AbstractDataFrame an iterator of rows. Still, some methods, like filter will work as-if it were an iterator of rows for user convenience. If you need an iterator of rows use eachrow or Tables.namedtupleiterator functions. This has additionally lead to deprecation of map on GroupedDataFrame as currently it produced a vector, while in the future we might want it to return some other object (for the same reason broadcasting of GroupedDataFrame object is also disallowed - now we have made this consistent).

Therefore we are heading towards 1.0 release with three inconsistencies in the design I am aware of (these things unfortunately have to be learned as exceptions):

filter accepts :col => predicate where predicate is applied per row, as opposed to all other places in the DataFrames.jl ecosystem, where it works on whole columns (we decided to leave this inconsistency for user convenience);
df[row, cols] produces a DataFrameRow, which is a view; normally a copy should be produced, but making a copy is inefficient, ane would be breaking; we believe that leaving this inconsistency will not lead to bugs in code, and in practice it is more convenient than making a copy;
df.col .= value assignment currently operates in-place, which is inconsistent with the whole design of DataFrames.jl (it should replace the column); this will be fixed in the future, see add support for getproperty broadcasting by bkamins · Pull Request #2655 · JuliaData/DataFrames.jl · GitHub; however, we are here limited by the Julia Base functionality, and the underlying infrastructure we needed to make things work properly will be only available in Julia 1.7 release, see https://github.com/JuliaLang/julia/pull/39473.

Skoffer · March 25, 2021, 1:19pm

Is it possible to have filter which operates on the whole column? Or at least function with the same functionality but other name. If such a function can also accept DataFrame as a first argument, it would be even better.

bkamins · March 25, 2021, 1:32pm

There is such a function implemented already, it is called subset. It will be included in 1.0 release.

nilshg · March 25, 2021, 2:09pm

Does 0.22.6 include faster joins?

bkamins · March 25, 2021, 2:14pm

No. This is something we discussed, but decided to leave-out all changes that include multi-threading from 0.22 branch. I encourage you to use ]add DataFrames#main as it should be safe to test.

bkamins · April 13, 2021, 10:13am

We have finalized DataFrames.jl 1.0 development. Now the only things left are two clean-up operations (Make PrettyTables.jl 1.0 as a dependency · Issue #2714 · JuliaData/DataFrames.jl · GitHub and https://github.com/JuliaData/DataFrames.jl/issues/2642). They should happen this week (we need PrettyTables.jl 1.0 released before finalizing - this is in sync with @Ronis_BR).

Therefore you can assume that checking out ] add DataFrames#main is a way to beta test the package before 1.0 release, which we would highly appreciate - any issues you would report are welcome.
When doing the testing please kindly keep two things in mind:

test with depwarn=yes enabled to make sure you do not miss any deprecated functionality
also test with -t auto (or -t N) enabled as we have added multi threading support to DataFrames.jl

As this release brings a lot of internal redesign we would highly appreciate both correctness and performance testing to ensure we do not have any regressions.

Thank you for all your support!

Yifan_Liu · April 17, 2021, 10:59pm

Will multithreading be supported for groupby and join operations in version 1.0?

bkamins · April 18, 2021, 6:33am

In general yes. But some operations are still single threaded. The major blocker to making all operations multithreaded is that Dict from Julia Base is not thread safe for writing (we would prefer to use standard implementation for this functionality).

StevenSiew · April 19, 2021, 3:00am

In general yes. But some operations are still single threaded. The major blocker to making all operations multithreaded is that Dict from Julia Base is not thread safe for writing (we would prefer to use standard implementation for this functionality).

How about petitioning Julia Base to create a new data structure called ThreadSafeDict to replace Dict, then you can easily fix the problem by find and replace all Dict with ThreadSafeDict

bkamins · April 19, 2021, 8:50am

We have already discussed this. The issue is not that simple. In general there are many standard structures and function in Julia Base that potentially can be multi-threaded but are currently single threaded.

Regarding Dict. The issue is not that we would want just ThreadSafeDict as such dictionary would have to use locking for making sure parallel reading and writing is safe. What we need is lock-free parallel writing to a dictionary data structure (i.e. for cases where writing happens in parallel and reading happens in parallel, but we know that writing and reading cannot happen at the same time). Most likely Julia Base devs consider such use case as too specific to include in Julia Base. We will see.

In general we are going to focus on performance of DataFrames.jl after 1.0 so in the worst case we will have our own implementation if it is needed. It will just take time (note how long it took to stabilize DataFrames.jl API enough to warrant its 1.0 release).

Skoffer · April 19, 2021, 9:26am

Out of curiosity, can’t you just use Vector{Dict} of the length nthreads? Each thread can write to its own dictionary and you can read from the whole Vector if needed. Of course, you still need some sort of conflict resolution, but it is really dependant on your case, I do not think that it is possible to invent generic enough rules so everyone is satisfied.

bkamins · April 19, 2021, 11:04am

This is exactly what we plan to do, but probably NTuple rather than Vector and use sharding based on key hash modulo nthreads.

Skoffer · April 19, 2021, 11:07am

This is awesome. Hope we will see it as a separate package one day, it will be interesting to toy with something like that.

Henrique_Becker · April 19, 2021, 1:43pm

So what is the suggested way for fast mutation of a column?

bkamins · April 19, 2021, 2:27pm

df[:, :col] .= value. The point is that df[:, :col] .= value has to stick to eltype of :col while df.col .= value will allow to assign any value to :col (at the cost of allocating).

Note that it will be an issue starting with 1.7 release of Julia, as currently the language does not allow for this distinction.

pdeffebach · April 19, 2021, 2:40pm

To clarif, sdf[!, :col] .= value will work for a SubDataFrame right?

bkamins · April 19, 2021, 2:52pm

Ah - sorry. Clearly it should be : not !, as ! is for replacement (and in target API df.col .= value and df[!, :col] .= value will be the same). I have edited my comment. (too many things happen in parallel)

Topic		Replies	Views
Easier way to split-apply-combine in DataFrames.jl General Usage dataframes	5	1108	December 14, 2020
DataFrame groups as an argument of a function General Usage question , dataframes	15	919	November 23, 2021
How to `combine` row vectors Data dataframes	5	123	December 18, 2024
DataFramesMeta.jl version 0.11.0 Release Package Announcements dataframesmeta	0	531	April 18, 2022
Data Cleaning: Split, Combine, Apply? New to Julia dataframes	9	781	October 28, 2021

Release announcements for DataFrames.jl

Related topics