I have access to a new machine that has 2 x 20 core Xeon Gold CPUs which support multithreading.  I was playing around with some simple pmap problems, and I was not seeing a performance jump when I tried going from 40 to 80 workers (hoping to benefit from the multithreading).  Does anyone have any suggestions on how to best leverage my computing environment with Julia for embarassingly parallel (i.e., pmap) type problems?
Have you set the environment variable JULIA_NUM_THREADS or started julia with julia -p 40 to start julia with 40 worker processes?
JULIA_NUM_THREADS doesn’t apply to pmap; that comes into play with Threads.@threads for.
@gideonsimpson, why would you expect performance to improve if you go from 40 to 80 workers if you only have 40 logical cores?
I’m aware - but it may be that the code from @gideonsimpson uses that macro in some place and they expected an increase in performance. I don’t know, as they didn’t specify in the first post 
I think he meant that he has 40 physical cores which support hyperthreading, ie. 80 threads.
Even if that’s the case, I actually wouldn’t expect performance to improve if you increase the number of workers past the number of physical cores.
I use addprocs at the beginning of my script, which I call with just julia script1.jl
To be clear, pmap distributes computation across processes which do not share memory; objects get serialized to get sent between processes, and the processes can run on remote machines. Maybe things aren’t speeding up when you add processes because of this serialization overhead?
Multi-threading on the other hand is shared-memory so does not incur that overhead. It is most easily used with the @threads macro, not pmap, and all threads are on the same machine. Multi-threading is experimental, but mostly works unless you’re doing IO on the threads. So you could try the @threads macro instead, but then you’d need to launch Julia in an environment where JULIA_NUM_THREADS is set for julia’s Threads to run on multiple cpu cores, otherwise all your @threads will do is run a bunch of tasks on the same core.
If he goes the threading route, I’d recommend taking a look at KissThreading.jl. Among other things, it offers a tmap! function and initializes a vector of Mersenne Twisters named TRNG that you can use if any of the code generates random numbers.
pmap and tmap! are better than @distributed for and @threads for when the functions being called take a while, and there’s some variance in that run time.
The former use dynamic scheduling, and the latter static scheduling.
I’ve normally tried using threads before distributed. However, I normally get poor scaling. Much worse than OpenMP. It’s probably my fault.