Pmap, Folds.map and ThreadsX.map perform worse than map for a seemly parallel task

I am trying to parallelize a simple function (doTheTransit in the MWE below). I have opened Julia in the terminal with -p 6 since I have a 6-Core processor.
Explanation of my MWE:
With the function createNecessary I am constructing a vector of tuples, each tuple consists of a dictionary and a vector of indexes. In the function doTheTransit I have an input parameter state and a tuple. Generally I want to compute a successor state to the input state. Here every element has a specific function, therefore a specific successor element, and the ith’s successor element is determined by certain elements in the existing state, these certain elements are idx in the function createNecessary. The dictionary created in the function createNecessary are all the possible combinations of idx (keys) and a successor element to the ith element (value).
I now want to use map to iterate over the vector of tuples and create the successor state by computing each successor element separately, with doTheTransit.

This task seems perfect for parallelizing it, but using pmap I have a lot more allocations and need a lot more time than using non-parallel map. I have also included Folds.map and ThreadsX.map since these are also suppose to parallelize the given task, both need a lot less time compared to pmap but are no faster than map.

I am very surprised by this outcome, since I would have liked to reduce the runtime of my functions by parallelizing this computation.
Am I doing something wrong?
Or why does the runtime increase?

using Distributed, Folds, ThreadsX

@everywhere function createNecessary(n)
    vectorOfTuple = Vector{Tuple{Dict{Vector{Int64}, Int64}, Vector{Int}}}(undef, n)
    randi = rand(1:(n-1), n)
    for i in 1:n
        dicti = Dict{Vector{Int64}, Int64}()
        idx = rand(1:n, randi[i])
        for j in 0:((1<<randi[i])-1)
            dicti[digits(j, base=2, pad= randi[i])] = rand(0:1)
        end
        vectorOfTuple[i] = (dicti,idx)
    end
    return vectorOfTuple
end 

@everywhere function doAllTransitions(funcis, state)
    println("Map:")
    @time map(fun -> doTheTransit(fun, state), funcis)
    println("Pmap:")
    @time pmap(fun -> doTheTransit(fun, state), funcis)
    println("Folds.map:")
    @time Folds.map(fun -> doTheTransit(fun, state), funcis)
    println("ThreadsX.map:")
    @time ThreadsX.map(fun -> doTheTransit(fun, state), funcis)
end

@everywhere function doTheTransit(func, state)
    tempi = state[func[2]]
    return func[1][tempi]
end

doAllTransitions(createNecessary(25), rand(0:1, 25))

The @time results were the following:

Map:
  0.000758 seconds (26 allocations: 4.031 KiB)
Pmap:
810.314207 seconds (46.93 M allocations: 11.876 GiB, 27.62% gc time, 0.01% compilation time)
Folds.map:
  0.076025 seconds (5.96 k allocations: 338.998 KiB, 81.34% compilation time)
ThreadsX.map:
  0.090041 seconds (7.45 k allocations: 413.728 KiB, 97.14% compilation time)

Did you try running things twice? You could be measuring compilation.

Yes I ran it twice and the outcome did not change.

Isolating test objects and initialization we get a MWE like

using Distributed, Folds, ThreadsX, BenchmarkTools

function createNecessary(n)
    vectorOfTuple = Vector{Tuple{Dict{Vector{Int64}, Int64}, Vector{Int}}}(undef, n)
    randi = rand(1:(n-1), n)
    for i in 1:n
        dicti = Dict{Vector{Int64}, Int64}()
        idx = rand(1:n, randi[i])
        for j in 0:((1<<randi[i])-1)
            dicti[digits(j, base=2, pad= randi[i])] = rand(0:1)
        end
        vectorOfTuple[i] = (dicti,idx)
    end
    return vectorOfTuple
end

@everywhere function doTheTransit(func, state)
    tempi = state[func[2]]
    return func[1][tempi]
end

function doAllTransitions1(funcis, state)
    map(fun -> doTheTransit(fun, state), funcis)
end
@everywhere function doAllTransitions2(funcis, state)
    pmap(fun -> doTheTransit(fun, state), funcis)
end
function doAllTransitions3(funcis, state)
    Folds.map(fun -> doTheTransit(fun, state), funcis)
end
function doAllTransitions4(funcis, state)
    ThreadsX.map(fun -> doTheTransit(fun, state), funcis)
end

vt = createNecessary(25)
println("Map:")
@btime doAllTransitions1($vt, rand(0:1, 25))
println("Pmap:")
@btime doAllTransitions2($vt, rand(0:1, 25))
println("Folds.map:")
@btime doAllTransitions3($vt, rand(0:1, 25))
println("ThreadsX.map:")
@btime doAllTransitions4($vt, rand(0:1, 25))

and see the result

Map:
  10.000 μs (27 allocations: 5.12 KiB)
Pmap:
  53.200 μs (212 allocations: 10.94 KiB)
Folds.map:
  10.800 μs (33 allocations: 5.53 KiB)
ThreadsX.map:
  21.900 μs (96 allocations: 10.97 KiB)

It looks to me as if the amount of parallelized work done on the (potentially) big data structure is negligible.

Still wondering about

Might that be due to the unnecessary *@everywhere* function createNecessary(n)?

This looks way better, but I was hoping to reduce runtime by parallelizing the task. In my actual code the length of state an therefore the number of iterations in map are mostly between 10 and 30 (not very big data structures), but the map function is called up to several hundred thousand times (in some cases also a few million times).

Thank you for that advice, I am still fairly new to parallel programming in Julia. I will remove the unnecessary @everywhere in my code and run it again.

The benchmark might not be the optimal representation of your problem then?

Instead of trying to parallelize map where it seems like there is little room for improvement, maybe you can parallelize over the calls to map (if they are independent.)

1 Like

Yes possibly, but I was very surprised by the outcome of this parallelization and posted it, hoping to improve it.

This is what I will try to do now.