I have a function that computes the mean of each column of a matrix 100 times. However, the serial version is as fast as the parallel version on my machine (3.5 GHz Intel Core i5 with 4 cores)

addprocs(3)
@everywhere function repeated_mean(x)
for _ in 1:100
mean(x)
end
return x
end
function f(X)
@time map(repeated_mean, X)
@time pmap(repeated_mean, X)
return nothing
end
f([rand(10_000_000) for j in 1:20])
#> 8.432926 seconds (2 allocations: 256 bytes)
#> 9.528860 seconds (377.27 k allocations: 1.524 GiB, 8.65% gc time)

What is happening? I naively expected the parallel version to be 4 times faster. Is there a better way to exploit multicores in this kind of computation?

The mean calculation is probably already highly parallel due to SIMD and IO-bound instead of compute bound since the computation is really cheap. Try a more expensive calculation.

SIMD happens at each core. If your problem is memory or latency-bound, your best option is closely examine the problem to see if there are opportunities to cut down on memory access.

Just to be clear (sorry I don’t know anything about this), IO-bound means that, even though each core reads a different vector, parallelization is slowed down because, in some vague sense, all the cores use a common stuff to read a vector? Is it related to having multiple cpus with one socket vs multiple cpus with multiple sockets?

It means that the computation is bound by the speed of moving things into and out of use/cache. Something like matrix multiplication re-uses the values multiple times. Other computations like exponents are just expensive. A bunch of additions in a row where each value is only used once can only be parallelized so much until it just can’t feed values in fast enough. As computations get more expensive, they are less bound by IO and that’s where parallelization is more useful (well, you can parallelize in IO bound cases, but not well locally like this).

using BenchmarkTools
addprocs(Sys.CPU_CORES)
@everywhere function repeated_mean(x)
s = zero(eltype(x))
for i = 1:100
s += mean(x)
end
return s
end
@everywhere function repeated_hard_mean(x)
s = zero(eltype(x))
for i = 1:100
s += mean(log(sin(exp(xi))) for xi in x)
end
return s
end

X = [rand(100_000) for j in 1:20]
@btime map(repeated_mean, $X) # 30.8 ms
@btime pmap(repeated_mean, $X) # 28.3 ms
@btime map(repeated_hard_mean, $X) # 11.3 s
@btime pmap(repeated_hard_mean, $X) # 3.61 s