Dagger one task per group in GDTable

SamTheGeneticCoder · July 26, 2024, 7:55pm

Hi!

I’m new to distributed computing with Julia and I’m trying to figure out how to make this problem work fine. I’d like to make spawning tasks on each group of a GDTable as efficient as possible.

I have a GDTable of something like 30’000 groups split into 30’000 partitions. I run a task on some columns of each group. The tasks’ compute times are very heterogeneous and fairly long.

I’m doing something like this:

using Distributed
addprocs(40, lazy = false) 
# I've put lazy = false because I had errors otherwise
using Dagger
using JLD2

@everywhere begin
    using DataFrames
    using DTables
    # etc
end

@everywhere function f(x1, x2)
    do_stuff(collect(x1), collect(x2)) # returns some data
end

data = JLD2.load(file, "data")  # a dataframe
data = DTable(data, tabletype = DataFrame)
gdata = DTables.groupby(data, :group) # N groups, N partitions

t = [Dagger.@spawn do_stuff(df.x1, df.x2) for (k, df) in gdata]
fetch.(t)

I’m trying to figure out : Can Dagger take into account df.xi is on which partition (and which process’s memory) to minimize data transfers between processes or does it spawn the task on any available worker in this case? If not : is there something I can do about it?

Note : Multi-threading instead of Distributed programming is not an option for this specific problem.

Thank you very much in advance for your help.

Sam

Topic		Replies	Views
Some Dagger.jl beginner questions Julia at Scale dagger	6	685	August 23, 2023
Dagger use cases? General Usage dagger	6	1319	September 6, 2022
Who is using Dagger? Julia at Scale parallel , dagger	14	3276	December 7, 2021
For DTable being built in Dagger.jl? Can I understand the rationale for having it vs just adding distributed arrays to DataFrames.jl? General Usage	0	262	July 31, 2021
How to use Dagger in remote machines? Julia at Scale dagger	3	720	November 5, 2022

Dagger one task per group in GDTable

Related topics