Some of my processes may take a minute, and others miliseconds, so load balancing seems like a good idea. Do both do this?
Any other things I should know about when using pmap vs. @threads?
I’ve tagged jump because I will be copying a JuMP model to each thread using copy_model.
It depends on your application. If you allocate / deallocate a lot of memory in thread (workers), you might prefer
pmap, since GC is distributed there. On the other hand if you need access and writes to shared structures, you might want
Threads.@threads. Also, if each task can take different length of time, something like
FLoops might be better then the static scheduler of
More information would be needed for a recommendation, but the best is always to test both approaches.
@threads does not. However, you could use
Threads.@spawn (see Jupyter Notebook Viewer for an example/comparison).
Note that you are comparing multithreading to multi-processing here.
Thanks for the info! To be honest I’m not super sure what the difference between multi-threading and multi-processing is…and which would be better for my application.
It might be nice to have access to shared data structures (for reading) but I could also just duplicate these data structures. I don’t need access to any shared containers for writing.
One way to think about it, especially with Distributed on Julia, is pmap could be running on entirely different computers - maybe separated by an ocean.
Your Threads all run on the local machine.
Ok, good to know. Can I actually run pmap across several servers?
Also is there anything wrong with doing this? Slash does this construct have a name?:
# call functions that use info from problem and from cell
I don’t know if that pattern has a specific name in Julia.
The general term for things like that is “Function Factory”.
And I hope they are OK. I use them all the time.
Does this “across an ocean” thing mean that I can’t access global variables? In addition to some copy-able variables, I have some really big data tables that I shouldn’t be copying with every cell that I pass to pmap.
There have been many discussions on this here on discourse (and, of course, elsewhere). See for example Multi-threading or multi-processing, how to know which to use and when?
Yes, that was my point of describing it like that.
If you want separate Processes to have access to data, they need their own copy somehow.
So two processes needing the same 1Gb of in-memory data, means 2Gb of RAM is required.
Whereas Threads can all read the same 1Gb.
Of course, the downside to that is the Threads can all (possibly) write to it too (if it is mutable), and orchestrating that becomes your problem to solve.
Oooo what if I use BOTH Threads@threads and pmap in different parts of my code? Is it bad to start julia with both multiple threads: -t 20 and multiple processes: -p 20 ? (for example)
And then in the main process, I first do something with shared memory using @threads and then later in the program I send work to my 20 workers?
Is there like a lot of overhead involved if I set both -t and -p >1 ?
No, do whatever solves your problem, there is no “one way to do things”.
In fact, mixing and matching will help you learn which to use next time.
And be sure to test your assumptions by running things on a single Thread / Process. What often seems like a good candidate to parallelize sees no benefit and may actually run slower because of the communication overhead and context switching.