How is the data ecosystem right now for large datasets?

My Stata benchmark wasn’t quite right. Stata doesn’t support multiple in memory datasets, so the by operation I benchmarked was actually a by plus a join.

Regarding joins, for DataTables:

dt2 =  by(dt, :B, d -> mean(d[:A]))
join(dt, dt2, on = :B)

For pandas:

df2 = mean(groupby(df, "B"))
df3 = merge(df, df2, left_on = "B", right_index = true)

I suspect there is some way to do this all in one go, like with broadcasting and the dot notation in Julia.

DataTables: 459s
Pandas: 14s