How to distribute the solving of a linear system on multiple Procs?

I intend to solve a linear system on a cluster and expect some scaling.
I tried to combine DistributedArrays.jl with IterativeSolvers.jl on my local machine but it would seem the overhead isn’t worth it.

using DistributedArrays, IterativeSolvers

a = rand(128,128); a_d = distribute(a)
v = rand(128)

@time gmres(a,v)   # 0.001 seconds
@time gmres(a_d,v) # 124 seconds

Are there iteratives or direct solvers that would easily utilize parallel resources ?