I am currently having issues using parallel processing using Julia’s Distributed library on an EC2 instance. My script attempts to speed up a series of sequential actions by putting some parts in parallel using the remotecall function. The code in question that I put in parallel copies a large number of files into a temp directory and performs some operations on them. Running this on my local machine via docker gives me the expected result and shows an expected speedup in terms of runtime. When I upload the exact same docker images to AWS ECR and start a Batch job which runs the same code in the same docker image that was successful locally on EC2. When running on EC2 the parallel steps appear to run correctly but later down the line the information that was generated during those steps isn’t available. Does anyone have experience using Julia’s Distributed library with EC2 instances or just with virtual CPUs in general, and did you have to do anything differently than with normal CPUs? Thanks in advance for any help.
Very stupid reply - the temp directory is going to be local to the individual VM. SO it will be different for each VM.
Are you using AWS pcluster to set up the cluster? You can request a filesystem which is common to all the VMs. Please forgive me if I do not understand the problem!
Duuh - read the post John. Its docker on ECR
If you use pcluster for a more traditional HPC type setup you can have Luster FsX storage
Thanks for the reply!
Maybe my post was a bit misleading, I’m just using the single EC2 instance. I guess the library itself ‘Distributed’ is a bit misleading as I’m only trying to use one computer, but my intention was to take advantage of the multiple vCPUs provided by EC2.