Pmap extremely slow when function returns large object

I would like to parallelize a function but returning large objects makes pmap extremely slow.

using Distributed
using SharedArrays

addprocs(length(Sys.cpu_info())-1)

@everywhere using Random
@everywhere function foo(i)
	Random.seed!(i)
	big_array = rand(100,100)
	Tuple(big_array)				# need be bits type
end

n = 10
big_shared_array = SharedArray{NTuple{100*100,Float32}}(n); 

pmap(1:n) do i
    big_shared_array[i] = foo(i);	#this is incredibly slow
end

For the first few minutes Julia processes (except for one) simply hang doing nothing, making the code exteremely slow – see figure below, CPU usage on the right.
Screenshot 2022-01-20 at 13.47.27

Am I doing something wrong? If not, is there another option to parallelize my code? I would prefer not to use multi-threading because, except for this specific case, it is slower than pmap.

Here is how I would compare the options

using BenchmarkTools
using Distributed
using SharedArrays

addprocs(length(Sys.cpu_info())÷2-1)

@everywhere using Random
@everywhere N = 100
@everywhere function foo(i)
	Random.seed!(i)
	rand(N,N)
end

n = 10

@btime array = pmap(1:n) do i
    foo(i)
end

array = Array{Matrix{Float64}}(undef, n);

@btime (for i in 1:n
    array[i] = foo(i)
end)

@btime (Threads.@threads for i in 1:n
    array[i] = foo(i)
end)

yielding

  605.000 μs (703 allocations: 812.53 KiB)
  90.100 μs (91 allocations: 786.59 KiB)
  60.300 μs (112 allocations: 789.94 KiB)

P.S.: I specifically don’t understand the necessity to use SharedArray here?

The function that I want to parallelize is way more demanding than the one in my example. For this reason parallelizing the function is faster than not doing so. Also, i tested my code both using threads and pmap: for some reason with multitreading the CPU is not fully utilized, resulting in slower code compared to parallelization (which uses the 100% of the CPU).

In extending this function, I wanted to return a big array (as the one in the example). This makes pmap considerably slower than multi-threading (possibly because of overhead?). I know I can solve this problem simply using threads, but I am interested in knowing if there is a way to continue using parallelization (which, again, uses the 100% of the CPU).

Maybe distributed arrays are a better option compared to shared arrays in this case. However, I do not understand how they work.

Here is the best I see with large tuples

using BenchmarkTools
using Distributed
using SharedArrays

addprocs(length(Sys.cpu_info())÷2-1)

@everywhere using Random
@everywhere N = 100
@everywhere function foo(i)
	Random.seed!(i)
	NTuple{N*N,Float64}(rand(N,N))
end

n = 10

@btime (array = pmap(1:n) do i
    foo(i)
end)
println()

yielding

  12.775 ms (200650 allocations: 8.42 MiB)

and that is not including considerable compilation time as you already noted. Regarding this see for example Correct way to dereference large memory? - #4 by jakobnissen

So, from what I understand, having large Tuples is the problem. I return a Tuple simply because SharedArrays requires bits type elements. Maybe I should look more into distributed arrays, as they do not require bits type elements (if I am not wrong). Thanks for your help.