Parallel deployment in DataFrames and CSV operations

Hi. Using Julia happily for 3 years.

Just checking before I do some potentially destructive file handling. I have huge amounts of monthly data and I want to clean-reformat it and I have a working code and an 8 core laptop.

In the code, I check if a substring is in a file name and then deletes the ones that are irrelevant and copies the keepers with a flag. Then later I delete everything except keepers.

I tested and did proof of concept for Jan. In Feb I optimised for speed with type specs and considering loop invariants blah blah.

All the action happens in a one discrete parent/working folder, per month, though things get copied and removed a little between folders within that.

I propose to open five terminals for five months at time for the remaining 10 months, and run a single thread Julia instance in each.

The question is more abstracted than the code. The code is unremarkable. It’s a question about Julia’s behaviour in parallel teriminals in Linux.

Here is the question.

Ubuntu box on standard x86-64. If I CTRL+ALT+T and bring up five parallel 1 thread instances can I safely run five months in parallel? Is it thread safe in that regard?

Thanks

Instances of Julia do not interfere with each other.

Although I have never run it for 5 months at a time.

1 Like

Good
Thanks
Solved

You might want to look into Distributed

which will spawn independent processes for you

then you won’t need 5 terminals

https://docs.julialang.org/en/v1/stdlib/Distributed/

1 Like

The only concern here that I would have is to make sure that the 5 processes are working on disjoint sets of files. I think it is also important to point out that if your code is very IO heavy then using 5 workers may not help much because IO does not parallelize well.

1 Like

Thank you @lawless-m
I am just doing it once so I’ll do it across terminals
I will look into this Pkg anyway for my edu

Thank you @tbeason
I do not need superefficiency, just a degree of parallelism is qualitatively different from none
I take your point re IO being especially constraining though