Reduce allocations in @distributed for

I’m doing some cheap operations on a set of huge SharedArrays, i.e. computing the distance from x,y,z.

using Distributed
addprocs(40);
@everywhere using SharedArrays

function compute_distance(result::SharedArray, pos::SharedArray)
    @sync @distributed for i = 1:size(pos,2)
        result[i] = sqrt(pos[1,i]^2 + pos[2,i]^2 + pos[3,i]^2)
    end
end

Is it always better to write this with a local variable, which I assume is broadcast to each worker process, to prevent allocations? For example,

function compute_distance(result::SharedArray, pos::SharedArray)
    temp = 0.0
    @sync @distributed for i = 1:size(input,2)
        result[i] = pos[1,i]^2
        temp = pos[2,i]^2
        result[i] += temp
        temp = pos[3,i]^2
        result[i] += temp
        result[i] = sqrt(result[i])
    end
end

This seems sort of ugly!

Why are you using Distributed for such a cheap calculation? I doubt the benefit exceeds the overhead.

Threads should be more appropriate if you really want to run the distance calculations in parallel. (It may be worth trying the 1.3 alpha branch as there has been a lot of work on multithreading recently.)

EDIT:
Here’s an example of using broadcasting to get rid of allocations and then with 8 threads (broadcasting not needed since it uses an explicit loop):

julia> function f1(p,r)
        r .= sqrt.(view(p,:,1).^2 .+ view(p,:,2).^2 .+ view(p,:,3).^2)
       end
f1 (generic function with 1 method)

julia> function f2(p,r)
         Threads.@threads for i in 1:length(r)
           r[i] = sqrt(p[i,1]^2 + p[i,2]^2 + p[i,3]^2)
         end
       end
f2 (generic function with 1 method)

julia> @btime f1(p,r) evals=1 setup=(p=rand(10^8, 3); r=Vector
{Float64}(undef,10^8));
  248.940 ms (3 allocations: 144 bytes)

julia> @btime f2(p,r) evals=1 setup=(p=rand(10^8, 3); r=Vector
{Float64}(undef,10^8));
  117.666 ms (60 allocations: 6.03 KiB)