Hi!
I’m new to distributed computing with Julia and I’m trying to figure out how to make this problem work fine. I’d like to make spawning tasks on each group of a GDTable
as efficient as possible.
I have a GDTable of something like 30’000 groups split into 30’000 partitions. I run a task on some columns of each group. The tasks’ compute times are very heterogeneous and fairly long.
I’m doing something like this:
using Distributed
addprocs(40, lazy = false)
# I've put lazy = false because I had errors otherwise
using Dagger
using JLD2
@everywhere begin
using DataFrames
using DTables
# etc
end
@everywhere function f(x1, x2)
do_stuff(collect(x1), collect(x2)) # returns some data
end
data = JLD2.load(file, "data") # a dataframe
data = DTable(data, tabletype = DataFrame)
gdata = DTables.groupby(data, :group) # N groups, N partitions
t = [Dagger.@spawn do_stuff(df.x1, df.x2) for (k, df) in gdata]
fetch.(t)
I’m trying to figure out : Can Dagger take into account df.xi
is on which partition (and which process’s memory) to minimize data transfers between processes or does it spawn the task on any available worker in this case? If not : is there something I can do about it?
Note : Multi-threading instead of Distributed programming is not an option for this specific problem.
Thank you very much in advance for your help.
Sam