How is the data ecosystem right now for large datasets?

aaowens · June 16, 2017, 3:09pm

My Stata benchmark wasn’t quite right. Stata doesn’t support multiple in memory datasets, so the by operation I benchmarked was actually a by plus a join.

Regarding joins, for DataTables:

dt2 =  by(dt, :B, d -> mean(d[:A]))
join(dt, dt2, on = :B)

For pandas:

df2 = mean(groupby(df, "B"))
df3 = merge(df, df2, left_on = "B", right_index = true)

I suspect there is some way to do this all in one go, like with broadcasting and the dot notation in Julia.

DataTables: 459s
Pandas: 14s

Topic		Replies	Views
DataFrames in Master (with NullableArrays) may use memory wastefully General Usage	9	1098	November 29, 2016
Struggling with Julia and large datasets General Usage question , big-data	67	11034	October 17, 2024
A living post of Julia vs R's data manipulation tasks speeds Data data	21	7769	August 27, 2021
[ANN] New and Improved JuliaDB Community package , announcement	14	2808	August 7, 2018
Julia stats, data, ML: expanding usability Statistics statistics	84	5002	October 14, 2021

How is the data ecosystem right now for large datasets?

Related topics