Why isn't pmap faster than map in this example?

matthieu · July 1, 2018, 5:44pm

I have a function that computes the mean of each column of a matrix 100 times. However, the serial version is as fast as the parallel version on my machine (3.5 GHz Intel Core i5 with 4 cores)

addprocs(3)
@everywhere function repeated_mean(x)
   for _ in 1:100
      mean(x)
   end
   return x
end
function f(X)
   @time map(repeated_mean, X)
   @time pmap(repeated_mean, X)
   return nothing
end
f([rand(10_000_000) for j in 1:20])
#>  8.432926 seconds (2 allocations: 256 bytes)
#>  9.528860 seconds (377.27 k allocations: 1.524 GiB, 8.65% gc time)

What is happening? I naively expected the parallel version to be 4 times faster. Is there a better way to exploit multicores in this kind of computation?

ChrisRackauckas · July 1, 2018, 5:51pm

I would try using a tmap

https://github.com/bkamins/KissThreading.jl

Make sure you enable threads of course.

matthieu · July 1, 2018, 5:59pm

@ChrisRackauckas using multithreading is a bit faster, but only by 6%

using KissThreading
function f(X)
   @time map!(repeated_mean, similar(X), X)
   @time tmap!(repeated_mean, similar(X), X)
   return nothing
end
Threads.nthreads()
#> 4
f([rand(10_000_000) for j in 1:20])
#>  8.242080 seconds (1 allocation: 240 bytes)
#>  7.594405 seconds (154 allocations: 10.547 KiB)

ChrisRackauckas · July 1, 2018, 6:08pm

The mean calculation is probably already highly parallel due to SIMD and IO-bound instead of compute bound since the computation is really cheap. Try a more expensive calculation.

matthieu · July 1, 2018, 6:40pm

I see. That’s the kind of simple computation I am interested in though. Does SIMD really use multicores?

stillyslalom · July 1, 2018, 6:48pm

SIMD happens at each core. If your problem is memory or latency-bound, your best option is closely examine the problem to see if there are opportunities to cut down on memory access.

matthieu · July 1, 2018, 6:53pm

Just to be clear (sorry I don’t know anything about this), IO-bound means that, even though each core reads a different vector, parallelization is slowed down because, in some vague sense, all the cores use a common stuff to read a vector? Is it related to having multiple cpus with one socket vs multiple cpus with multiple sockets?

ChrisRackauckas · July 1, 2018, 7:28pm

It means that the computation is bound by the speed of moving things into and out of use/cache. Something like matrix multiplication re-uses the values multiple times. Other computations like exponents are just expensive. A bunch of additions in a row where each value is only used once can only be parallelized so much until it just can’t feed values in fast enough. As computations get more expensive, they are less bound by IO and that’s where parallelization is more useful (well, you can parallelize in IO bound cases, but not well locally like this).

stillyslalom · July 1, 2018, 7:36pm

An illustrative example (on a dual-core i5):

using BenchmarkTools
addprocs(Sys.CPU_CORES)

@everywhere function repeated_mean(x)
   s = zero(eltype(x))
   for i = 1:100
      s += mean(x)
   end
   return s
end

@everywhere function repeated_hard_mean(x)
   s = zero(eltype(x))
   for i = 1:100
      s += mean(log(sin(exp(xi))) for xi in x)
   end
   return s
end

X = [rand(100_000) for j in 1:20]

@btime map(repeated_mean, $X)       # 30.8 ms
@btime pmap(repeated_mean, $X)      # 28.3 ms

@btime map(repeated_hard_mean, $X)  # 11.3 s
@btime pmap(repeated_hard_mean, $X) # 3.61 s

Topic		Replies	Views
Why is the parallel map so slow? General Usage parallel , optimization , pmap	2	3209	May 10, 2020
Struggling with pmap New to Julia parallel	8	993	September 5, 2019
Pmap extremely slow when function returns large object Performance question , performance , parallel	4	818	January 20, 2022
Pmap slow compared to map General Usage performance , parallel	11	3035	September 25, 2018
Pmap, Folds.map and ThreadsX.map perform worse than map for a seemly parallel task Performance question , package , parallel	10	803	November 15, 2021

Why isn't pmap faster than map in this example?

Related topics