Who does "better" than DataFrames?

Depending on the context, using a lighter weight table seems to outperform DataFrames when including the cost to construct the table:

Setup
const s=repeat(1:10^6, inner=4)
const t=repeat(1:4, 10^6)
const r=rand(4*10^6)

using DataFrames

const df = DataFrame(;s,t,r)

function df_trial(s,t,r)
    df = DataFrame(;s,t,r)
    combine(groupby(df, :r), :s=>maximum)
end


using TypedTables
const tbl = Table(s=s,t=t,r=r)

function inner_loop!(group_maxima, s, r)
    for i in eachindex(s)
        @inbounds k = s[i]
        @inbounds group_maxima[k] = max(get(group_maxima, k, -Inf), r[i])
    end
end

# This is the same as @johnmyleswhite, but uses TypedTables instead
function custom_maximum(df)
    s, r = df.s, df.r
    group_maxima = Dict{eltype(s), Float64}()
    inner_loop!(group_maxima, s, r)
    Table(
        s = collect(keys(group_maxima)),
        r_maximum = collect(values(group_maxima)),
        )
end
    
function tbl_trial(s,t,r)
    tbl = Table(s=s,t=t,r=r)
    custom_maximum(tbl)
end
    
tbl_trial(s,t,r)
df_trial(s,t,r)
combine(groupby(df, :s), :r=>maximum)
custom_maximum(tbl)
# create and group a dataframe
julia> @benchmark df_trial($s,$t,$r)
BenchmarkTools.Trial: 24 samples with 1 evaluation.
 Range (min โ€ฆ max):  201.259 ms โ€ฆ 234.272 ms  โ”Š GC (min โ€ฆ max): 4.96% โ€ฆ 9.57%
 Time  (median):     217.930 ms               โ”Š GC (median):    8.59%
 Time  (mean ยฑ ฯƒ):   215.847 ms ยฑ  10.645 ms  โ”Š GC (mean ยฑ ฯƒ):  7.98% ยฑ 2.94%  

  โ–ƒ         โ–ƒ                       โ–ƒ     โ–ˆ         โ–ƒ
  โ–ˆโ–‡โ–‡โ–โ–โ–โ–‡โ–‡โ–โ–‡โ–ˆโ–โ–โ–โ–โ–โ–โ–‡โ–โ–โ–โ–‡โ–โ–โ–โ–โ–โ–โ–โ–‡โ–โ–‡โ–โ–โ–ˆโ–โ–โ–โ–โ–โ–ˆโ–โ–‡โ–โ–‡โ–โ–โ–โ–โ–โ–ˆโ–โ–โ–โ–โ–โ–โ–โ–‡โ–โ–‡ โ–
  201 ms           Histogram: frequency by time          234 ms <

 Memory estimate: 311.97 MiB, allocs estimate: 308.

# create and group a typedtable
julia> @benchmark tbl_trial($s,$t,$r)
BenchmarkTools.Trial: 34 samples with 1 evaluation.
 Range (min โ€ฆ max):  140.362 ms โ€ฆ 165.967 ms  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 5.27%
 Time  (median):     151.275 ms               โ”Š GC (median):    3.38%
 Time  (mean ยฑ ฯƒ):   151.464 ms ยฑ   5.304 ms  โ”Š GC (mean ยฑ ฯƒ):  2.77% ยฑ 1.79%  

                     โ–ˆ  โ–ƒ โ–ƒ โ–ˆโ–ƒ   โ–ƒโ–ƒ                           โ–ƒ
  โ–‡โ–โ–โ–โ–โ–‡โ–โ–โ–โ–โ–‡โ–‡โ–โ–โ–โ–โ–‡โ–‡โ–‡โ–ˆโ–‡โ–โ–ˆโ–‡โ–ˆโ–‡โ–ˆโ–ˆโ–‡โ–‡โ–โ–ˆโ–ˆโ–‡โ–โ–‡โ–โ–โ–โ–‡โ–โ–โ–โ–โ–โ–โ–‡โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–ˆ โ–
  140 ms           Histogram: frequency by time          166 ms <

 Memory estimate: 80.43 MiB, allocs estimate: 58.

# just group an existing DataFrame
julia> @benchmark combine(groupby($df, :s), :r=>maximum)      
BenchmarkTools.Trial: 157 samples with 1 evaluation.
 Range (min โ€ฆ max):  27.072 ms โ€ฆ 41.158 ms  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 22.72%
 Time  (median):     30.126 ms              โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   32.019 ms ยฑ  3.704 ms  โ”Š GC (mean ยฑ ฯƒ):  8.71% ยฑ  9.68%

      โ–ƒโ–‚โ–„โ–ˆโ–‡โ–‚ โ–„โ–‚                     โ–‚  โ–‚
  โ–†โ–โ–…โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–ˆโ–ˆโ–‡โ–โ–ƒโ–†โ–ƒโ–ƒโ–†โ–ƒโ–…โ–ƒโ–ƒโ–ƒโ–†โ–ƒโ–ƒโ–โ–ƒโ–โ–…โ–ƒโ–…โ–ˆโ–‡โ–‡โ–ˆโ–†โ–…โ–โ–…โ–ˆโ–‡โ–ƒโ–†โ–ƒโ–ˆโ–โ–โ–โ–โ–โ–โ–ƒโ–ƒโ–โ–โ–ƒ โ–ƒ
  27.1 ms         Histogram: frequency by time        40.6 ms <

 Memory estimate: 55.33 MiB, allocs estimate: 327.

# Just group an existing Table
julia> @benchmark custom_maximum($tbl)
BenchmarkTools.Trial: 34 samples with 1 evaluation.
 Range (min โ€ฆ max):  142.760 ms โ€ฆ 172.040 ms  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 2.32%
 Time  (median):     150.780 ms               โ”Š GC (median):    3.56%
 Time  (mean ยฑ ฯƒ):   151.268 ms ยฑ   5.719 ms  โ”Š GC (mean ยฑ ฯƒ):  2.94% ยฑ 1.97%

                โ–‚     โ–ˆ
  โ–…โ–ˆโ–โ–…โ–โ–…โ–…โ–…โ–…โ–…โ–โ–…โ–…โ–…โ–ˆโ–…โ–…โ–ˆโ–…โ–…โ–ˆโ–โ–โ–โ–โ–…โ–…โ–โ–ˆโ–โ–…โ–โ–โ–…โ–โ–…โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–… โ–
  143 ms           Histogram: frequency by time          172 ms <

 Memory estimate: 80.43 MiB, allocs estimate: 58.

Iโ€™m curious if anybody try InMemoryDatasets.jl. According to [ANN] A new lightning fast package for data manipulation in pure Julia, it is faster than DataFrames.

Not in this specific case
julia> @btime InMemoryDatasets.combine(InMemoryDatasets.groupby(ds, :s), :r=>maximum)
  75.507 ms (575 allocations: 131.65 MiB)
1000000ร—2 Dataset
     Row โ”‚ s         maximum_r 
         โ”‚ identity  identity
         โ”‚ Int64?    Float64?
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
       1 โ”‚        1   0.847286
       2 โ”‚        2   0.851498
       3 โ”‚        3   0.958781
       4 โ”‚        4   0.982241
       5 โ”‚        5   0.886476
       6 โ”‚        6   0.972176
       7 โ”‚        7   0.838771
       8 โ”‚        8   0.938864
       9 โ”‚        9   0.569116
      10 โ”‚       10   0.908266
      11 โ”‚       11   0.841863
      12 โ”‚       12   0.884953
      13 โ”‚       13   0.795271
      14 โ”‚       14   0.800355
      15 โ”‚       15   0.79398
      16 โ”‚       16   0.665463
      17 โ”‚       17   0.824255
      18 โ”‚       18   0.857776
    โ‹ฎ    โ”‚    โ‹ฎ          โ‹ฎ
  999984 โ”‚   999984   0.895037
  999985 โ”‚   999985   0.894406
  999986 โ”‚   999986   0.71278
  999987 โ”‚   999987   0.707465
  999988 โ”‚   999988   0.890672
  999989 โ”‚   999989   0.71128
  999990 โ”‚   999990   0.736819
  999991 โ”‚   999991   0.776283
  999992 โ”‚   999992   0.21387
  999993 โ”‚   999993   0.753587
  999994 โ”‚   999994   0.838413
  999995 โ”‚   999995   0.898894
  999996 โ”‚   999996   0.996568
  999997 โ”‚   999997   0.891596
  999998 โ”‚   999998   0.761242
  999999 โ”‚   999999   0.976949
 1000000 โ”‚  1000000   0.789219
            999965 rows omitted

julia> @btime DataFrames.combine(DataFrames.groupby(df, :s), :r=>maximum)
  32.127 ms (350 allocations: 55.33 MiB)
1000000ร—2 DataFrame
     Row โ”‚ s        r_maximum 
         โ”‚ Int64    Float64
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
       1 โ”‚       1   0.847286
       2 โ”‚       2   0.851498
       3 โ”‚       3   0.958781
       4 โ”‚       4   0.982241
       5 โ”‚       5   0.886476
       6 โ”‚       6   0.972176
       7 โ”‚       7   0.838771
       8 โ”‚       8   0.938864
       9 โ”‚       9   0.569116
      10 โ”‚      10   0.908266
      11 โ”‚      11   0.841863
      12 โ”‚      12   0.884953
      13 โ”‚      13   0.795271
      14 โ”‚      14   0.800355
      15 โ”‚      15   0.79398
      16 โ”‚      16   0.665463
      17 โ”‚      17   0.824255
      18 โ”‚      18   0.857776
    โ‹ฎ    โ”‚    โ‹ฎ         โ‹ฎ
  999983 โ”‚  999983   0.585042
  999984 โ”‚  999984   0.895037
  999985 โ”‚  999985   0.894406
  999986 โ”‚  999986   0.71278
  999987 โ”‚  999987   0.707465
  999988 โ”‚  999988   0.890672
  999989 โ”‚  999989   0.71128
  999990 โ”‚  999990   0.736819
  999991 โ”‚  999991   0.776283
  999992 โ”‚  999992   0.21387
  999993 โ”‚  999993   0.753587
  999994 โ”‚  999994   0.838413
  999995 โ”‚  999995   0.898894
  999996 โ”‚  999996   0.996568
  999997 โ”‚  999997   0.891596
  999998 โ”‚  999998   0.761242
  999999 โ”‚  999999   0.976949
 1000000 โ”‚ 1000000   0.789219
           999964 rows omitted
1 Like

but apart from the implementation aspects, what is the algorithm(s?) used?

In the grouping algo an open-addressing hashing table with linear probing is used to find groups. Look at row_group_slots! in src/groupeddataframe/utils.jl file of DataFrames.

On my computer, even unique(s) is slower than the whole combine statement (no threading). It is surprising. This is worth digging into, in order to make rest of eco-system more optimized.

UPDATE:

Actually, after a bit more digging, it turns out DataFrames uses a few more assumptions which allow it to make things faster. Essentially, it assumes the groups defined by the :s column are the integers between its extrema: (1, 1_000_000). Which makes the need for any sorting and hashing spurious. And it allows to construct the result immediately. Itโ€™s a useful optimization and isnโ€™t cheating, but this wouldnโ€™t have the generality of any of the other methods attempted here.

The relevant code in DataFrames, is DataFrames.refpool_and_array(df.s).

Here is a benchmark, showing such an optimization allows achieving double the speed of DataFrames in custom-tailored way:

function testy(s,r)
    res = fill(0.0, 1_000_000)
    @inbounds for i in eachindex(s)
        res[s[i]] = max(res[s[i]], r[i])
    end
    pairs(res)
end

and

julia> @btime combine(groupby(df, :s),:r=>maximum; threads=false);
  162.666 ms (334 allocations: 55.33 MiB)

julia> @btime testy($s,$r);
  80.950 ms (2 allocations: 7.63 MiB)
3 Likes

but does this โ€œtrickโ€ work for non-numeric columns as well?

Nope. Easy enough to check.

using Random
using OrderedCollections
using DataFrames

r = rand(4*10^6)
st = shuffle(repeat([randstring(5) for _ in 1:10^6],inner=4))
df = DataFrame(; st, r)

function teststr(s,r)
    res = sizehint!(Dict{String, Float64}(),length(s)รท4)
    @inbounds for i in 1:length(s)
        mergewith!(max, res, LittleDict((s[i],),(r[i],)))
    end
    res
end

and (@time after compilation run):

julia> combine(groupby(df, :st),:r=>maximum; threads=false);

julia> teststr(st,r);

julia> @time combine(groupby(df, :st),:r=>maximum; threads=false);
  1.555349 seconds (286 allocations: 148.879 MiB)

julia> @time teststr(st,r);
  1.633208 seconds (7 allocations: 34.001 MiB)

Finally, DataFrames manages not to be so darn fast :wink:

2 Likes

Is it really effective to use sizehint!(), instead of just declaring an empty Dict?
But above all Iโ€™m interested in knowing the strategy that this function implements.

Good question. I donโ€™t think it really matters, but itโ€™s a good habit to provide information to code. Perhaps a future implementation of Dict will be more sensitive to this.

Strategy is pretty simple. There is a little trick with LittleDict and mergewith! which avoids double calculation of hash (!) and does the max bit.

Thank you!
You actually anticipated a further question. (with the previous I was referring to sizehint).
At a very high level, your solution looks like this, but almost 2x faster.
You attribute this advantage to avoiding the hash computation (which I have the vaguest idea about).
Can you tell me where in the mgr4 this calculation takes place?

function mgr41(s,r)
         d=Dict{Int, Float64}()
         for i in eachindex(s)
           ri=r[i]
           e=s[i]
           d[e]=max(get(d,e,ri), ri)
         end
         d
       end

DataFrames devs spent so much time and effort optimizing performance in various โ€œspecialโ€ cases that these cases stop being โ€œspecialโ€ and encountering them becomes more typical (:

Would be really great if the wider Julia ecosystem (including Base!) could benefit from this effort. Itโ€™s a shame these optimizations are only available for a single type of a single package.
The more general interface, still simple but applicable to a wide range of collection types, is there โ€” but the implementation is not always as performant as in DataFrames.

As far as Iโ€™m aware, the most performant groupby-style functions not tied to specific container types are in FlexiGroups (I am the author):

using StructArrays, FlexiGroups

tbl = StructArray(
	s=repeat(1:10^6, inner=4),
	t=repeat(1:4, 10^6),
	r=rand(4*10^6),
)
@p let
	tbl
	groupview(_.s)
	map(maximum(_.r))
end

This reads intuitive, and works reasonably fast โ€” but always uses the straightforward dict-based algorithm, making it a few times slower that DataFrames for numbers.

2 Likes

That line does hashing twice โ€“ once to run get! and then later to run d[e], which is setindex!.

1 Like

A problem with the mergewith! method is the need to generate a ton of (thankfully, costless) LittleDicts. To avoid this, we need to add an AbstractDict which isnโ€™t actually a Dict because it breaks the contract of unique keys. Perhaps there is a better method to do this, but defining another AbstractDict is one:

struct FakeDict{K,V} <: AbstractDict{K,V}
    k::Vector{K}
    v::Vector{V}
end
Base.values(fd::FakeDict{K,V}) where {K,V} = fd.v
Base.keys(fd::FakeDict{K,V}) where {K,V} = fd.k
Base.IteratorSize(::Type{FakeDict}) = Base.HasLength()
Base.IteratorEltype(::Type{FakeDict}) = Base.HasEltype()
Base.eltype(fd::FakeDict{K,V}) where {K,V} = Pair{K,V}
Base.length(fd::FakeDict{K,V}) where {K,V} = @inline length(fd.k)
Base.iterate(fd::FakeDict{K,V}, s::Int) where {K,V} = @inline @inbounds s > length(fd.k) ? nothing : (fd.k[s]=>fd.v[s], s+1)
Base.iterate(fd::FakeDict{K,V}) where {K,V} = @inline @inbounds isempty(fd.k) ? nothing : (fd.k[1]=>fd.v[1], 2)

Armed with this FakeDict, we can:

using Random
using BenchmarkTools
using DataFrames
s = shuffle(repeat([randstring(5) for _ in 1:10^6],inner=4));
r = rand(4*10^6);

fd = FakeDict(s,r);
df = DataFrame(;s,r);

# warm up the methods
combine(groupby(df, :s),:r=>maximum; threads=false);
mergewith!(max, sizehint!(Dict{String, Float64}(),10^6), fd);

and then measure:

julia> @time mergewith!(max, sizehint!(Dict{String, Float64}(),10^6), fd);
  1.696373 seconds (7 allocations: 34.001 MiB)

julia> @time combine(groupby(df, :s),:r=>maximum; threads=false);
  1.622105 seconds (286 allocations: 148.880 MiB)
1 Like

if you wanted to make such an expression work, what should you implement?
setindex!, getindex, othersโ€ฆ?

mergewith!(max, FakeDict{String, Float64}(), fd)

PS
perhaps the biggest problem would be the setindex function!()

For any columnular table type though, it should be fast (non-copying) to wrap it in a DataFrame (DataFrame(table; copycols=false)), perform whatever manipulations, and then convert it to whatever other table type you want. DataFrames are pretty light so it doesnโ€™t seem like a big problem to use them as an interface for one piece of code, even if you want another table type for a downstream computation.

A big question would be what should this FakeDict achieve in this setting? As without having unique keys, it is mostly like a couple of vectors in a struct. The idea in the code was to use it to supply mergewith! with pairs s=>r in order to put in a real Dict which maintains unique keys guarantee.

That only works for flat tables, right? The power of Julia is that arbitrary types work just as well as builtins:

some_func(x::MyStruct) = ...
data = [MyStruct(x=1, y=2), ...]
data |> group(some_func) |> map(first)

All structs seamlessly propagate through the whole data manipulation pipeline. They can also be nested.
Converting to a flat table type for intermediate computations probably means limiting oneself to just namedtuples.

In this particular case you can use the structure of the s index in df to beat DataFrames:

using DataFrames
using BenchmarkTools

s=repeat(1:10^6, inner=4)
t=repeat(1:4, 10^6)
r=rand(4*10^6)

df=DataFrame(;s,t,r)
@btime combine(groupby(df, :s),:r=>maximum)
@btime maximum(reshape(r, 4, :), dims=1)

@assert combine(groupby(df, :s),:r=>maximum).r_maximum == vec(maximum(reshape(r, 4, :), dims=1))

I get: 45.615 ms and 19.339 ms

Some query engines in relational databases can make these kinds of optimizations if the s column is the result of joining smaller tables that the database understands.

One thing that I discovered when running benchmarks on this is that DataFrames groupby/combine can beat Base.unique, so any optimizations under the hood could possibly benefit Base Julia

@btime combine(groupby(df, :s), :s => first)
@btime unique(s)

@assert combine(groupby(df, :s), :s => first).s_first == unique(s)

Which gives: 23.328 ms and 84.853 ms

3 Likes

Not quite sure what you mean by โ€œflat tableโ€ here, but to be non-copying to a DataFrame you do need a column-oriented table, and this is a row-oriented table, meaning the data is stored as a collection of rows.

But I donโ€™t really see why preserving types at all costs is more important (or more related to โ€œthe power of Juliaโ€) than being able to compose functionality. In a tabular world, if everyone writes functions that accept any table, and emit some table, then everything will compose fine (and can be perfectly type stable). It doesnโ€™t seem very advantageous to ask that every function also needs to emit the exact same type that was input. In some cases, an algorithm might be much more efficient to implement with a columnular structure, so if you have a row table like that, and require every function to emit the exact same type of row table, then internally the function might need to convert to a columnular structure, perform the manipulations, and convert back. But what if the next function also works better on columns? Then it needs to do the same thing again. Why not just emit the intermediate columnular table, and let the function decide if it wants a row-oriented or column-oriented table, instead of always converting back to the input type?

edit: to make it more concrete, Iโ€™m proposing workflows like this:

using Tables, DataFrames, Test
# Some generic processing functions, maybe written in another package

# Accepts any table, returns a DataFrame
function first_from_each_group(table)
    df = DataFrame(table; copycols=false)
    return combine(first, groupby(df, :x))
end

# Accepts any table, processes each row. `f` should accept NamedTuples.
function iterate_over_rows(f, table)
    foreach(f, Tables.namedtupleiterator(table))
end

# My code to do a specific task
struct MyStruct
    x::Float64
    y::Matrix{Float32}
    z::Vector{Int}
end
randstruct() = MyStruct(rand([0.25, 0.5]), rand(Float32, 10, 10), rand(1:10, 10))

function my_code()
    # Create my table; maybe naturally row oriented bc I'm reading from a CSV or log file or
    # running a simulation producing one row at a time
    table = [randstruct(), randstruct(), randstruct()]

    # Do some processing using generic code that accepts any table & emits some table
    firsts = first_from_each_group(table)
    iterate_over_rows(firsts) do row
        # Do something that requires my specific type wrapper, maybe serialization, or a strictly-typed function
        println(MyStruct(row...))
    end
end

1 Like

This drifts kinda sideways wrt the original topic, butโ€ฆ

โ€œFlat tableโ€ โ€” a simple 2d container with columns (each defined by a name) and rows, basically what you can find in a CSV file or a dataframe (:

Row vs column oriented wasnโ€™t the focus of my message. Sure, column-oriented is often faster, and Julia transparently supports this - see StructArrays. Package (eg FlexiGroups) also work with them, often with no special effort.

Sure, me neither! Donโ€™t see where you could get that from.
My point is that using โ€œregularโ€ collections (arrays, dictionaries, โ€ฆ) is more composable: both with the vast julia array ecosystem and with custom types.
Itโ€™s more general than โ€œflatโ€ tables, and remains performant and convenient โ€“ including traditional tabular operations.

In the Julia world, if everyone writes functions that accept and return basic collections, then everything composes fine (and can be perfectly type stable).
It happens that many functions do return the same โ€œkindโ€ of container as input. Both Base data processing functions and in packages.

Yeah, and this is directly supported with e.g. StructArrays. One can either freely mix SoA (Struct-of-Arrays, โ€œcolumnarโ€ in table-speak) and AoS keeping the same array interface, or stick to StructArrays everywhere for performance/convenience of columnar storage.

But โ€“ really? All those back and forth conversions instead of table |> group(_.x) |> map(first)โ€ฆ And the latter directly supports arbitrary objects, including nested.

Iโ€™ve had too many Python with its pandas dataframes with columns like id, flux_A, flux_A_err, flux_B, flux_B_err instead of the natural (; id, fluxes=(A=val ยฑ err, B=val ยฑ err)). The latter isnโ€™t really possible in Python, but works perfectly fine in Julia.