With regard to multi-thread support with DataFrames, I understand that it remains a WIP.
I wanted to validate whether some situations that appear, at least on the surface, like good candidates for to benefit from multi-threading were on the radar or subject to some other constraints.
Transformation involving a vector of operations.
When multiple transforms are passed as arguments to a df, wouldn’t it be safe to @threads
over that vector of operators?
For example:
using DataFrames
using BenchmarkTools
nrows = 1_000_000
df = DataFrame(id=rand(["A"], nrows), v1=rand(nrows), v2=rand(nrows), v3=rand(nrows), v4=rand(nrows))
f1 = "v1" => ByRow(exp) => "new1"
f2 = "v2" => ByRow(exp) => "new2"
f3 = "v2" => ByRow(exp) => "new3"
f4 = "v2" => ByRow(exp) => "new4"
funs = [f1, f2, f3, f4]
function df_trans_A(df, funs)
transform!(df, funs[1])
transform!(df, funs[2])
transform!(df, funs[3])
transform!(df, funs[4])
end
function df_trans_B(df, funs)
transform!(df, funs)
end
# 19.500 ms (644 allocations: 30.55 MiB)
@btime df_trans_A($df, $funs);
# 19.325 ms (410 allocations: 30.54 MiB)
@btime df_trans_B($df, $funs);
We can see that currently, calling 4 times transform!
, one for each of the transform, is equivalent to calling transform!
on the vector operations, which could have been safely processed in parallel.
Performance regression on grouped dataframes
Following the above example, if performing the same transformations over a dataframe grouped on the id
, the same performance could reasonable be expected since there’s only a single group (id="A"
).
However, it appears that the sequence of transformation is almost 3x slower, while the transform on the vectors of operators takes a similar number of time.
gdf = groupby(df, "id")
function gdf_trans_A(gdf, funs)
transform!(gdf, funs[1])
transform!(gdf, funs[2])
transform!(gdf, funs[3])
transform!(gdf, funs[4])
end
function gdf_trans_B(gdf, funs)
transform!(gdf, funs)
end
# 56.028 ms (2963 allocations: 152.76 MiB)
@btime gdf_trans_A($gdf, $funs);
# 19.652 ms (1124 allocations: 129.76 MiB)
@btime gdf_trans_B($gdf, $funs);
Also, if the same experience is repeated but with 10 groups, the results further deteriorates for both methods, although the underlying operations remain the same (byrow exp of each element):
# 10 groups
# 114.379 ms (3284 allocations: 195.40 MiB)
@btime gdf_trans_A($gdf, $funs);
# 49.737 ms (1444 allocations: 172.41 MiB)
@btime gdf_trans_B($gdf, $funs);
Multiple operators on Combine
Similar to the situation with transform!
but with combine
where threading seems not being taken advantage of:
using DataFrames
using BenchmarkTools
nrows = 1_000_000
df = DataFrame(id=rand(["A", "B", "C", "D", "E", "F", "G", "H" ,"I", "J"], nrows), v1=rand(nrows), v2=rand(nrows), v3=rand(nrows), v4=rand(nrows))
df = DataFrame(id=rand(["A"], nrows), v1=rand(nrows), v2=rand(nrows), v3=rand(nrows), v4=rand(nrows))
f1 = "v1" => sum => "new1"
f2 = "v2" => sum => "new2"
f3 = "v2" => sum => "new3"
f4 = "v2" => sum => "new4"
funs = [f1, f2, f3, f4]
function df_trans_A(df, funs)
dfg = groupby(df, :id)
agg = combine(dfg, funs[1])
agg = combine(dfg, funs[2])
agg = combine(dfg, funs[3])
agg = combine(dfg, funs[4])
end
function df_trans_B(df, funs)
dfg = groupby(df, :id)
agg = combine(dfg, funs)
end
# 1 Group
# 26.565 ms (1067 allocations: 31.33 MiB)
@btime df_trans_A($df, $funs);
# 17.194 ms (546 allocations: 31.29 MiB)
@btime df_trans_B($df, $funs);
# 10 Groups
# 18.536 ms (1068 allocations: 31.33 MiB)
@btime df_trans_A($df, $funs);
# 15.977 ms (546 allocations: 31.29 MiB)
@btime df_trans_B($df, $funs);
function df_trans_A(df, funs)
agg = combine(df, funs[1])
agg = combine(df, funs[2])
agg = combine(df, funs[3])
agg = combine(df, funs[4])
end
function df_trans_B(df, funs)
agg = combine(df, funs)
end
# 1.664 ms (444 allocations: 28.44 KiB)
@btime df_trans_A($df, $funs);
# 1.640 ms (382 allocations: 21.69 KiB)
@btime df_trans_B($df, $funs);
# 1.665 ms (444 allocations: 28.44 KiB)
@btime df_trans_A($df, $funs);
# 1.631 ms (382 allocations: 21.69 KiB)
@btime df_trans_B($df, $funs);
To sum up, are the above situations effectively good candidates for parallelization in DataFrames?