Distributed performance depends on the number of workers?

I’m executing a code that solves a linear system (Ax = b) for a fixed matrix A, and for several vectors b_i here i runs from 1 to N. I set N to match the number of available cores in my system, therefore I’ll be solving N = 32 (for example) linear systems in parallel.

I do this by putting @everywhere in front of every required function, defining the matrix of solutions as a SharedArray (that contains every vector solution (x1 x2 … x32)) and adding the tags @sync @distributed in front of the for that runs over the i=1:N(=32) vectors b_i.

However, when sent to the cluster using slurm, I noticed that the running time is larger than expected compared to a single run in my personal machine. In my personal machine all the code (for one vector b1) lasts ~500 s while in the cluster using N = 16 (and hence 16 workers with @sync @distributed) it is taking longer than 2700 s. Worst of all, when attempting to send more jobs than available workers (32 vectors b_i with @sync @distributed over 16 workers) for ~2700 s it was already done.

What could be going on? Is there a way to avoid this overhead and produce faster results?