You forgot an important thing; the arrays must be declared as SharedArrays so that the arrays can be available to every worker, even for this simple calculation the parallel version is slightly better. Here are my results:
function sdot(n, x, y)
a = 0.0
@inbounds @fastmath for i=1:n
a += x[i]*y[i]
end
a
end
@everywhere @inbounds @fastmath function pdot(n, x, y)
a = @parallel (+) for i=1:n
x[i]*y[i]
end
end
addprocs(7)
n = 10^7
x = SharedArray{Float64,1}( ones(n) )
y = SharedArray{Float64,1}( ones(n) )
println("Naive Julia")
println(@sprintf(" %.2f", sdot(n, x, y)))
@time for i=1:3
sdot(n, x, y)
end
println("\nNative Julia")
println(@sprintf(" %.2f", dot(x, y)))
@time for i=1:3
dot(x, y)
end
println("\nIdiomatic parallel Julia")
println(@sprintf(" %.2f", pdot(n, x, y)))
@time for i=1:3
pdot(n, x, y)
end
The timings:
Naive Julia
10000000.00
0.027176 seconds (3 allocations: 48 bytes)
Native Julia
10000000.00
0.027156 seconds (3 allocations: 48 bytes)
Idiomatic parallel Julia
10000000.00
0.025830 seconds (4.49 k allocations: 356.578 KiB)
[Finished in 4.5s]
The difference becomes clearer for larger inputs (say n = 10^9):
Naive Julia
1000000000.00
4.235259 seconds (3 allocations: 48 bytes)
Native Julia
1000000000.00
3.088961 seconds (3 allocations: 48 bytes)
Idiomatic parallel Julia
1000000000.00
2.282985 seconds (4.47 k allocations: 355.719 KiB)
[Finished in 29.5s]