Pmap, Folds.map and ThreadsX.map perform worse than map for a seemly parallel task

anneJulia · November 13, 2021, 8:55pm

I am trying to parallelize a simple function (doTheTransit in the MWE below). I have opened Julia in the terminal with -p 6 since I have a 6-Core processor.
Explanation of my MWE:
With the function createNecessary I am constructing a vector of tuples, each tuple consists of a dictionary and a vector of indexes. In the function doTheTransit I have an input parameter state and a tuple. Generally I want to compute a successor state to the input state. Here every element has a specific function, therefore a specific successor element, and the ith’s successor element is determined by certain elements in the existing state, these certain elements are idx in the function createNecessary. The dictionary created in the function createNecessary are all the possible combinations of idx (keys) and a successor element to the ith element (value).
I now want to use map to iterate over the vector of tuples and create the successor state by computing each successor element separately, with doTheTransit.

This task seems perfect for parallelizing it, but using pmap I have a lot more allocations and need a lot more time than using non-parallel map. I have also included Folds.map and ThreadsX.map since these are also suppose to parallelize the given task, both need a lot less time compared to pmap but are no faster than map.

I am very surprised by this outcome, since I would have liked to reduce the runtime of my functions by parallelizing this computation.
Am I doing something wrong?
Or why does the runtime increase?

using Distributed, Folds, ThreadsX

@everywhere function createNecessary(n)
    vectorOfTuple = Vector{Tuple{Dict{Vector{Int64}, Int64}, Vector{Int}}}(undef, n)
    randi = rand(1:(n-1), n)
    for i in 1:n
        dicti = Dict{Vector{Int64}, Int64}()
        idx = rand(1:n, randi[i])
        for j in 0:((1<<randi[i])-1)
            dicti[digits(j, base=2, pad= randi[i])] = rand(0:1)
        end
        vectorOfTuple[i] = (dicti,idx)
    end
    return vectorOfTuple
end 

@everywhere function doAllTransitions(funcis, state)
    println("Map:")
    @time map(fun -> doTheTransit(fun, state), funcis)
    println("Pmap:")
    @time pmap(fun -> doTheTransit(fun, state), funcis)
    println("Folds.map:")
    @time Folds.map(fun -> doTheTransit(fun, state), funcis)
    println("ThreadsX.map:")
    @time ThreadsX.map(fun -> doTheTransit(fun, state), funcis)
end

@everywhere function doTheTransit(func, state)
    tempi = state[func[2]]
    return func[1][tempi]
end

doAllTransitions(createNecessary(25), rand(0:1, 25))

The @time results were the following:

Map:
  0.000758 seconds (26 allocations: 4.031 KiB)
Pmap:
810.314207 seconds (46.93 M allocations: 11.876 GiB, 27.62% gc time, 0.01% compilation time)
Folds.map:
  0.076025 seconds (5.96 k allocations: 338.998 KiB, 81.34% compilation time)
ThreadsX.map:
  0.090041 seconds (7.45 k allocations: 413.728 KiB, 97.14% compilation time)

pdeffebach · November 13, 2021, 8:57pm

Did you try running things twice? You could be measuring compilation.

anneJulia · November 13, 2021, 9:41pm

Yes I ran it twice and the outcome did not change.

goerch · November 13, 2021, 9:43pm

Isolating test objects and initialization we get a MWE like

using Distributed, Folds, ThreadsX, BenchmarkTools

function createNecessary(n)
    vectorOfTuple = Vector{Tuple{Dict{Vector{Int64}, Int64}, Vector{Int}}}(undef, n)
    randi = rand(1:(n-1), n)
    for i in 1:n
        dicti = Dict{Vector{Int64}, Int64}()
        idx = rand(1:n, randi[i])
        for j in 0:((1<<randi[i])-1)
            dicti[digits(j, base=2, pad= randi[i])] = rand(0:1)
        end
        vectorOfTuple[i] = (dicti,idx)
    end
    return vectorOfTuple
end

@everywhere function doTheTransit(func, state)
    tempi = state[func[2]]
    return func[1][tempi]
end

function doAllTransitions1(funcis, state)
    map(fun -> doTheTransit(fun, state), funcis)
end
@everywhere function doAllTransitions2(funcis, state)
    pmap(fun -> doTheTransit(fun, state), funcis)
end
function doAllTransitions3(funcis, state)
    Folds.map(fun -> doTheTransit(fun, state), funcis)
end
function doAllTransitions4(funcis, state)
    ThreadsX.map(fun -> doTheTransit(fun, state), funcis)
end

vt = createNecessary(25)
println("Map:")
@btime doAllTransitions1($vt, rand(0:1, 25))
println("Pmap:")
@btime doAllTransitions2($vt, rand(0:1, 25))
println("Folds.map:")
@btime doAllTransitions3($vt, rand(0:1, 25))
println("ThreadsX.map:")
@btime doAllTransitions4($vt, rand(0:1, 25))

and see the result

Map:
  10.000 μs (27 allocations: 5.12 KiB)
Pmap:
  53.200 μs (212 allocations: 10.94 KiB)
Folds.map:
  10.800 μs (33 allocations: 5.53 KiB)
ThreadsX.map:
  21.900 μs (96 allocations: 10.97 KiB)

It looks to me as if the amount of parallelized work done on the (potentially) big data structure is negligible.

goerch · November 13, 2021, 11:05pm

Still wondering about

Might that be due to the unnecessary *@everywhere* function createNecessary(n)?

anneJulia · November 14, 2021, 11:04am

This looks way better, but I was hoping to reduce runtime by parallelizing the task. In my actual code the length of state an therefore the number of iterations in map are mostly between 10 and 30 (not very big data structures), but the map function is called up to several hundred thousand times (in some cases also a few million times).

anneJulia · November 14, 2021, 11:06am

Thank you for that advice, I am still fairly new to parallel programming in Julia. I will remove the unnecessary @everywhere in my code and run it again.

goerch · November 14, 2021, 11:33am

The benchmark might not be the optimal representation of your problem then?

Jeff_Emanuel · November 14, 2021, 8:51pm

Instead of trying to parallelize map where it seems like there is little room for improvement, maybe you can parallelize over the calls to map (if they are independent.)

anneJulia · November 15, 2021, 8:15am

Yes possibly, but I was very surprised by the outcome of this parallelization and posted it, hoping to improve it.

anneJulia · November 15, 2021, 8:19am

This is what I will try to do now.

Topic		Replies	Views
Weird behavior of pmap General Usage	5	1424	July 2, 2019
Poor speed gain using `pmap` Performance parallel , pmap	17	1191	August 6, 2021
How to parallelize list comprehension and map with multithreading? General Usage multithreading	20	5675	July 2, 2021
Struggling with pmap New to Julia parallel	8	1000	September 5, 2019
Pmap slow compared to map General Usage performance , parallel	11	3042	September 25, 2018

Pmap, Folds.map and ThreadsX.map perform worse than map for a seemly parallel task

Related topics