The ultimate guide to distributed computing

shashi · August 14, 2020, 1:03pm

As @juliohm pinged me in [ANN] FileTrees.jl -- easy everyday parallelism on trees of files, I thought I’d give an example of how this could be done with FileTrees.jl

# SETUP FOR ALL PROCESSES
# -----------------------
@everywhere begin
  # instantiate environment
  using Pkg; Pkg.instantiate()

  # load dependencies
  using CSV, FileTrees, DataFrames
function process(infile)
  # read file from disk
  csv = CSV.read(infile)

  # perform calculations
  sleep(60) # pretend it takes time
  csv.new = rand(size(csv,1))

end
end

# MAIN SCRIPT
# -----------

# relevant directories
indir  = joinpath(@__DIR__,"data")
outdir = joinpath(@__DIR__,"results")

# files to process
infiles  = FileTree(indir)

new_dfs = FileTrees.load(infiles, lazy=true) do f
  try
    process(path(f))
  catch e
    @warn "failed to process $(infiles[i])"
    DataFrame() # store an empty file?
  end
end

outfiles = rename(indir, outdir)
FileTrees.save(outfiles) do file
  # save new file to disk
  CSV.write(outfile, csv)
end # computation actually happens here.

This will launch the tasks using the Dagger scheduler. It will use both Threads and Distributed procs. If Julia is started from an env with JULIA_NUM_THREADS=N.

You can apply mapvalues on new_dfs, and it will be a lazy operation and will not bring the data to the master process. You can also use reducevalues to reduce an iterator of all the DataFrames.

This same code will work if the files are in a nested directory structure, besides, you get to filter files with glob"" syntax.

So it’s useful if you have a cluster but also want to use multithreading on each node in the cluster to speed up communication.

@juliohm I don’t think I have anything more to add about Distributed than that which is already mentioned on this thread though!

Topic		Replies	Views
Help setting up Julia on a cluster Julia at Scale question , parallel , cluster	28	15090	March 4, 2020
How to query Slurm Array Task ID in Julia? Julia at Scale question	15	2941	November 19, 2020
Several questions on distributed computing from a beginner General Usage question , parallel-computing	21	640	May 28, 2024
Julia on Cluster with SSH Restriction General Usage question , cluster	18	4014	January 16, 2021
Is ClusterManagers.jl maintained? Or, how to do multi-node calculations in Julia? General Usage question , package	44	2305	July 13, 2024

The ultimate guide to distributed computing

Related topics