Parallelizing Multiple Workers for File Operations

Hi all,

This is my first time trying this out but here is a problem and what I am trying to do and what I am stuck on trying to do:

Problem: I have many big files (200+, 50 - 100GB each) that I am going to read out. I amidoing some analysis on it and then need to save the results to a file.

Approach: I am using a python library to handle reading the file fine. I am going to parallelize reading and analyzing the files. Then, I am going to write each worker’s results to a separate arrow file and then finally concatenate each file together into one master arrow file.

Where I am stuck: how do I create worker processes in Julia that each read and analyze a file and then write to a file? how do I tell the workers what file to work on? Over all, is this a good approach? Or am I missing some thing?

Thank you!

P.S. I read on Discourse about this and is it as simple as writing something like:

Threads.@thread for f in files
    do thing in thread
end

?

~ tcp :deciduous_tree:

GitHub - shashi/FileTrees.jl: Parallel file processing made easy does this for you if you just want to get going quickly. Just add threads or workers and use the lazy flag.

1 Like