Parallel deployment in DataFrames and CSV operations

AlanAylmer · March 18, 2022, 8:29am

Hi. Using Julia happily for 3 years.

Just checking before I do some potentially destructive file handling. I have huge amounts of monthly data and I want to clean-reformat it and I have a working code and an 8 core laptop.

In the code, I check if a substring is in a file name and then deletes the ones that are irrelevant and copies the keepers with a flag. Then later I delete everything except keepers.

I tested and did proof of concept for Jan. In Feb I optimised for speed with type specs and considering loop invariants blah blah.

All the action happens in a one discrete parent/working folder, per month, though things get copied and removed a little between folders within that.

I propose to open five terminals for five months at time for the remaining 10 months, and run a single thread Julia instance in each.

The question is more abstracted than the code. The code is unremarkable. It’s a question about Julia’s behaviour in parallel teriminals in Linux.

Here is the question.

Ubuntu box on standard x86-64. If I CTRL+ALT+T and bring up five parallel 1 thread instances can I safely run five months in parallel? Is it thread safe in that regard?

Thanks

lawless-m · March 18, 2022, 8:53am

Instances of Julia do not interfere with each other.

Although I have never run it for 5 months at a time.

AlanAylmer · March 18, 2022, 11:35am

Good
Thanks
Solved

lawless-m · March 18, 2022, 11:53am

You might want to look into Distributed

which will spawn independent processes for you

then you won’t need 5 terminals

https://docs.julialang.org/en/v1/stdlib/Distributed/

tbeason · March 18, 2022, 12:07pm

The only concern here that I would have is to make sure that the 5 processes are working on disjoint sets of files. I think it is also important to point out that if your code is very IO heavy then using 5 workers may not help much because IO does not parallelize well.

AlanAylmer · March 18, 2022, 4:39pm

Thank you @lawless-m
I am just doing it once so I’ll do it across terminals
I will look into this Pkg anyway for my edu

AlanAylmer · March 18, 2022, 4:41pm

Thank you @tbeason
I do not need superefficiency, just a degree of parallelism is qualitatively different from none
I take your point re IO being especially constraining though

Topic		Replies	Views
Parallel computing: running from terminal vs. editor. Same code, missing parallelisation General Usage question , parallel	0	516	April 16, 2020
Running multiple Julia sessions simultaneously New to Julia	4	502	October 17, 2022
Reading and processing Data files concurrently Data parallel	18	3800	September 20, 2017
Base.Filesystem concurrency for separate julia instances New to Julia question , parallel , filesystem	0	350	September 5, 2020
How to start tasks on multiple threads and control terminal output from central thread General Usage parallel , multithreading , distributed	5	1105	March 17, 2020

Parallel deployment in DataFrames and CSV operations

Related topics