Reading and processing Data files concurrently

amit.murthy · September 20, 2017, 4:37pm

You can try both these approaches and see which fits your use case better.

Distributed, multi-process - use addprocs(N) and pmap

pseudocode:

addprocs()
@everywhere begin
   using DataFrames
   function process_file(fname)
   .......
   end
end

results = pmap(process_file, list_of_files)

Multi-threaded, single process. Process in sets of N. Psuedocode would be something like this:

const N = 4
const data_chnl = Channel{Any}(N)
@schedule begin
  @sync for f in list_of_files
    @async put!(data_chnl, readTable(f, sep, h))
  end
  close(data_chnl)
end

data=[]
for d in data_chnl
  push!(data, d)
  if length(data) == N
    Threads.@threads for d2 in data
       process_read_file(d2)
    end
    empty!(data)
  end 
end

if length(data) > 0
 Threads.@threads for d2 in data
   process_read_file(d2)
 end
 empty!(data)
end

Topic		Replies	Views
Multithreaded CSV writes Performance multithreading , csv	20	3467	April 14, 2023
CSV read in is too slow than other language General Usage performance	13	1371	June 21, 2023
CSV.jl's CSV write seems slow Performance	32	5743	January 28, 2020
Reading Data Is Still Too Slow Data	35	8824	August 2, 2019
CSV read performance vs Pandas General Usage	29	8165	May 6, 2019

Reading and processing Data files concurrently

Related topics