Why isn't pmap faster than map in this example?

I have a function that computes the mean of each column of a matrix 100 times. However, the serial version is as fast as the parallel version on my machine (3.5 GHz Intel Core i5 with 4 cores)

addprocs(3)
@everywhere function repeated_mean(x)
   for _ in 1:100
      mean(x)
   end
   return x
end
function f(X)
   @time map(repeated_mean, X)
   @time pmap(repeated_mean, X)
   return nothing
end
f([rand(10_000_000) for j in 1:20])
#>  8.432926 seconds (2 allocations: 256 bytes)
#>  9.528860 seconds (377.27 k allocations: 1.524 GiB, 8.65% gc time)

What is happening? I naively expected the parallel version to be 4 times faster. Is there a better way to exploit multicores in this kind of computation?

I would try using a tmap

https://github.com/bkamins/KissThreading.jl

Make sure you enable threads of course.

1 Like

@ChrisRackauckas using multithreading is a bit faster, but only by 6%

using KissThreading
function f(X)
   @time map!(repeated_mean, similar(X), X)
   @time tmap!(repeated_mean, similar(X), X)
   return nothing
end
Threads.nthreads()
#> 4
f([rand(10_000_000) for j in 1:20])
#>  8.242080 seconds (1 allocation: 240 bytes)
#>  7.594405 seconds (154 allocations: 10.547 KiB)

The mean calculation is probably already highly parallel due to SIMD and IO-bound instead of compute bound since the computation is really cheap. Try a more expensive calculation.

1 Like

I see. That’s the kind of simple computation I am interested in though. Does SIMD really use multicores?

SIMD happens at each core. If your problem is memory or latency-bound, your best option is closely examine the problem to see if there are opportunities to cut down on memory access.

1 Like

Just to be clear (sorry I don’t know anything about this), IO-bound means that, even though each core reads a different vector, parallelization is slowed down because, in some vague sense, all the cores use a common stuff to read a vector? Is it related to having multiple cpus with one socket vs multiple cpus with multiple sockets?

It means that the computation is bound by the speed of moving things into and out of use/cache. Something like matrix multiplication re-uses the values multiple times. Other computations like exponents are just expensive. A bunch of additions in a row where each value is only used once can only be parallelized so much until it just can’t feed values in fast enough. As computations get more expensive, they are less bound by IO and that’s where parallelization is more useful (well, you can parallelize in IO bound cases, but not well locally like this).

1 Like

An illustrative example (on a dual-core i5):

using BenchmarkTools
addprocs(Sys.CPU_CORES)

@everywhere function repeated_mean(x)
   s = zero(eltype(x))
   for i = 1:100
      s += mean(x)
   end
   return s
end

@everywhere function repeated_hard_mean(x)
   s = zero(eltype(x))
   for i = 1:100
      s += mean(log(sin(exp(xi))) for xi in x)
   end
   return s
end
X = [rand(100_000) for j in 1:20]

@btime map(repeated_mean, $X)       # 30.8 ms
@btime pmap(repeated_mean, $X)      # 28.3 ms

@btime map(repeated_hard_mean, $X)  # 11.3 s
@btime pmap(repeated_hard_mean, $X) # 3.61 s
7 Likes