Innefficient paralellization? Need some help optimizing a simple dot product

You forgot an important thing; the arrays must be declared as SharedArrays so that the arrays can be available to every worker, even for this simple calculation the parallel version is slightly better. Here are my results:

function sdot(n, x, y)
    a = 0.0
    @inbounds @fastmath for i=1:n
        a += x[i]*y[i]   
    end
    a
end
@everywhere @inbounds @fastmath function pdot(n, x, y)
    a = @parallel (+) for i=1:n
        x[i]*y[i]
    end
end

addprocs(7)
n = 10^7
x = SharedArray{Float64,1}( ones(n) ) 
y = SharedArray{Float64,1}( ones(n) ) 

println("Naive Julia")
println(@sprintf("  %.2f", sdot(n, x, y)))
@time for i=1:3
    sdot(n, x, y)
end

println("\nNative Julia")
println(@sprintf("  %.2f", dot(x, y)))
@time for i=1:3
    dot(x, y)
end

println("\nIdiomatic parallel Julia")
println(@sprintf("  %.2f",  pdot(n, x, y)))
@time for i=1:3
    pdot(n, x, y)
end

The timings:

Naive Julia
  10000000.00
  0.027176 seconds (3 allocations: 48 bytes)

Native Julia
  10000000.00
  0.027156 seconds (3 allocations: 48 bytes)

Idiomatic parallel Julia
  10000000.00
  0.025830 seconds (4.49 k allocations: 356.578 KiB)
[Finished in 4.5s]

The difference becomes clearer for larger inputs (say n = 10^9):

Naive Julia
  1000000000.00
  4.235259 seconds (3 allocations: 48 bytes)

Native Julia
  1000000000.00
  3.088961 seconds (3 allocations: 48 bytes)

Idiomatic parallel Julia
  1000000000.00
  2.282985 seconds (4.47 k allocations: 355.719 KiB)
[Finished in 29.5s]