DataFrames v0.15.0 released

announcement

#1

@nalimilan has just tagged the DataFrames.jl package release v0.15.0.

It contains many major changes, in particular a significant improvement of split-apply-combine family of functions - both in terms of speed and usability.

You can read the full list of enchancements with examples here https://juliasnippets.blogspot.com/2018/12/release-notes-for-dataframesjl-package.html.


#2

In the group-by it mentuons the new API and shows how to do it for functions with one argument eg sum and maximum. How do I perform groupby with functions that accept two or more arguments? E.g. correlation of two columns within by-group?


#3
julia> df = DataFrame(x = repeat(1:2, 5), a=rand(10), b=rand(10));

julia> by(df, :x, cor = (:a, :b) => x->cor(x...))
2×2 DataFrame
│ Row │ x     │ cor      │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.204087 │
│ 2   │ 2     │ 0.463879 │

EDIT: then columns are passed as a NamedTuple of vectors so in the example I splat it.


#4

I wonder if this can be “improved”? The splatting seems “unnatural” somehow; or maybe it’s just me not getting used to it. But clearly this is superior to what we had before!! Thanks @nalimilan!


#5

Kudos on the new release, the improvement in the API and performance of split-apply-combine are massive!

Concerning the splatting, a trick could be deconstructing the tuple (when it is more natural to use many positional arguments rather than a single NamedTuple). Referring to the columns with . is also a possibility. For example:

by(df, :x, cor = (:a, :b) => ((a, b),)->cor(a,b))
by(df, :x, cor = (:a, :b) => x->cor(x.a,x.b))

#6

Also keep in mind that this is supposed to be the “low-level” API, which DataFramesMeta can use to allow for an even more convenient syntax.


#7

Just did a quick test. The aggregation took 30 seconds in Julia for 100 millions rows with 1 million groups and R’s data.table took 15 second. But R has R interning so it’s cheating a bit. InternedStrings.jl is now really fast but after converting the strings to internedString there is no way to specify a different algorithm for faster grouping; probably need to bring out the interned string group-by algorithm from FastGroupBy.jl

I need to update https://github.com/xiaodaigh/DataBench.jl to v1.0 to do the tests.


#8

Did you use Vector{String} or a CategoricalVector? Categorical should be faster if I remember correctly (and if not we know what to do to make it faster).


#9

Yes, Julia’s categorical type will be faster because data.table hasn’t implemented the optimisation for categorical grouping


#10

Actually - I have just run the test on 100m rows and 1m groups using:

by(df, :x, y = :y=>sum)

and if x is Vector{String} is 2x faster than when it is CategoricalVector, which is surprising. We should investigate into the reasons.

CC @nalimilan


#11

For categorical one can use counting sort to speed up the grouping


#12

I think categorical arrays are not optimized for only 100 observations per group. The typical ratio would be much higher than that. Do you get the same result with 1000 or 10000?


Julia performs poorly on group-by benchmarks
#13

Here is a benchmark you can run:

using Random
using DataFrames

function testspeed(m, n)
    Random.seed!(1234)
    println("\n$m categories, $(m*n) total rows")
    x = repeat([randstring() for i in 1:m], n)
    println("Categorical generation time")
    @time y = categorical(x)
    df = DataFrame(x=x, y=y, z=1)
    println("String time")
    @time by(df, :x, :z=>sum)
    println("Categorical time")
    @time by(df, :y, :z=>sum)
    nothing
end

testspeed(10, 10) # precompile

for i in 1:6
    testspeed(10^i, 10^(8-i))
end

which produces

10 categories, 100000000 total rows
Categorical generation time
  7.474020 seconds (100.00 M allocations: 3.353 GiB, 19.12% gc time)
String time
  5.538156 seconds (253 allocations: 3.980 GiB, 19.41% gc time)
Categorical time
  2.484464 seconds (303 allocations: 2.235 GiB, 26.33% gc time)

100 categories, 100000000 total rows
Categorical generation time
  5.180744 seconds (100.00 M allocations: 3.353 GiB, 19.99% gc time)
String time
  7.026005 seconds (1.06 k allocations: 3.980 GiB, 16.44% gc time)
Categorical time
  3.776462 seconds (1.30 k allocations: 2.235 GiB, 18.57% gc time)
1000 categories, 100000000 total rows
Categorical generation time
  5.077316 seconds (100.00 M allocations: 3.353 GiB, 21.20% gc time)
String time
  6.861087 seconds (10.15 k allocations: 3.981 GiB, 16.28% gc time)
Categorical time
  3.750762 seconds (12.21 k allocations: 2.236 GiB, 17.40% gc time)

10000 categories, 100000000 total rows
Categorical generation time
  6.873999 seconds (100.01 M allocations: 3.355 GiB, 15.22% gc time)
String time
  8.694547 seconds (109.16 k allocations: 3.983 GiB, 12.69% gc time)
Categorical time
  5.585156 seconds (138.20 k allocations: 2.240 GiB, 12.53% gc time)

100000 categories, 100000000 total rows
Categorical generation time
  7.396810 seconds (100.10 M allocations: 3.377 GiB, 14.65% gc time)
String time
 11.422482 seconds (999.15 k allocations: 4.012 GiB, 9.91% gc time)
Categorical time
  7.270968 seconds (1.30 M allocations: 2.287 GiB, 9.57% gc time)

1000000 categories, 100000000 total rows
Categorical generation time
 27.256516 seconds (101.00 M allocations: 3.560 GiB, 10.31% gc time)
String time
 12.614024 seconds (8.00 M allocations: 4.241 GiB, 15.22% gc time)
Categorical time
 28.275873 seconds (11.00 M allocations: 2.678 GiB, 8.29% gc time)

so the problem kicks-in for many small categories (but maybe we could handle such case). Also note categorical generation time which is large and that for smaller number of categories we might try to get bigger gains.


#14

The code below, which is one of the examples at https://github.com/JuliaPlots/StatPlots.jl gives an error now, the violin plot appears, but the boxplot does not. Is this just due to StatPlots needing to catch up to something new in DataFrames?

import RDatasets
singers = RDatasets.dataset("lattice","singer")
@df singers violin(:VoicePart,:Height,marker=(0.2,:blue,stroke(0)))
@df singers boxplot!(:VoicePart,:Height,marker=(0.3,:orange,stroke(2)))

#15

I have checked and the following code:

@df singers boxplot(:VoicePart,:Height,marker=(0.3,:orange,stroke(2)))

works as expected.

Also:

@df singers boxplot!(:VoicePart,:Height,marker=(0.3,:orange,stroke(2)))

failed under DataFrames 0.14.1.

Finally inspecting @macroexpand result shows that the columns are correctly extracted from the data frame and something bad happens later.

In summary: unless I completely mixed up things the problem is with calling boxplot! after violin and is unrelated to the DataFrames.jl package (but maybe it should fail - I do not know the StatPlots.jl package well enough).

However, thank you for reporting; the DataFrames.jl package is one of the oldest in the ecosystem and we need to make many breaking changes to catch up with Julia 1.0 so we might unintentionally break something on the way. We will try to be responsive in fixing issues.


#16

We could easily fall back to the general method if the number of levels is large compared with the number of rows, but categorical arrays are not designed for that, so… As you note, generation will be inefficient anyway in the presence of many levels. Basically, generating a categorical array should take about the same time as grouping, which appears to be the case as long as there aren’t too many levels.

BTW, note that for this kind of benchmark groupby should be tested rather than by, since the combine step is common to the two different array types.


#17

Agreed - but you never know what usage pattern you will see in the wild.
Actually the “problematic case” is the one that @xiaodai originally asked about.


#18

Yes but AFAICT it’s not really slower when there are many levels, it’s about the same speed as Vector{String}.


#19

See the last benchmark in my post above - it is over 2x slower (unless it happens only on my computer).


#20

OK, I had missed the last one. It should be possible to have a fast path for when there are too many categories which doesn’t sort groups, and do the sorting only in a second step (as for strings). The grouping will then be a simple copy of integer codes.