ok, I spent a few days benchmarking various versions of local multiprocessing under julia 0.6.2, with 4-core and 8-core intel processors. The results are at the rear of http://julia.cookbook.tips/doku.php?id=parallel . roughly speaking, Threads work wonderfully as long as the function needs very little memory. Threads can deteriorate badly when each function call needs a good chunk of memory. Threads then turn worse than single-processing, which is understandable. What is less understandable is that threads then also turn worse than pmap. I am surmising that the OS has a better scheduler for such situations than Julia-internal. So, if you need to remember one thing:
use Threads if your function and its memory easily fit into the L1 cache.
the next important aspect to remember is that pmap deteriorates badly (3-4 orders of magnitude) with short function calls.
Never use pmap if you have many short function calls.
@parallel is useless, but not murder—like a factor 2 rather than a factor 1,000 slower than sequential processing. Finally,
@parallel seems good for middle-of-the-road tasks, with modest memory needs and a medium amount of overhead (functions not too many and too brief).
The number of CPU or threads choice can be tweaked, but its importance pales in comparison to the advice above. Good rules of thumb (not a perfect rule) are:
threads: use the maximum number (lower of CPU threads and function invokcations to be run).
workers: use the maximum number of cores plus 2 as your number of workers