Some attempts… note that I’ve made these all write into an existing array, as the times are otherwise very variable, for me, I guess that allocating a whole GB each run stresses my computer:

```
julia> out = zeros(10^4, 10^4);
julia> @btime pairwise!($out, Euclidean(), $arr, $arr, dims=2); # original, but in-place
675.518 ms (4 allocations: 156.41 KiB)
julia> @btime pairwise!($out, Euclidean(), $arr, $arr2, dims=2); # with a copy of array
311.783 ms (4 allocations: 156.41 KiB)
julia> using Tullio
julia> out ≈ @tullio out2[i,j] := (arr[μ,i] - arr[μ,j])^2 |> sqrt
true
julia> @btime @tullio $out[i,j] = ($arr[μ,i] - $arr[μ,j])^2 |> sqrt;
171.028 ms (50 allocations: 4.81 KiB)
julia> using LoopVectorization
julia> @btime @tullio $out[i,j] = ($arr[μ,i] - $arr[μ,j])^2 |> sqrt;
90.907 ms (50 allocations: 4.81 KiB)
julia> @btime @tullio $out[i,j] = sqrt(($arr[1,i] - $arr[1,j])^2 + ($arr[2,i] - $arr[2,j])^2 + ($arr[3,i] - $arr[3,j])^2);
79.451 ms (44 allocations: 4.86 KiB)
```

This is multi-threaded (2 cores) which might count as cheating.

Also, note that transposing the problem makes it faster, at least when using LoopVectorization. It likes having a dimension longer than 3 to vectorise along:

```
julia> arr_t = permutedims(arr);
julia> @btime @tullio $out[i,j] = ($arr_t[i,μ] - $arr_t[j,μ])^2 |> sqrt;
68.236 ms (50 allocations: 4.81 KiB)
```

What these versions can’t do is iterate only over half the space. For comparison, in-place versions of @lmiq’s SVector code above, on my computer:

```
julia> function mydistmat_self_symm!(out, x)
n = length(x)
for i in 1:n # faster without @inbounds, weirdly?
for j in 1:i
out[j,i] = d3(x[i],x[j])
end
end
Symmetric(out)
end;
julia> d3(x,y) = sqrt((x[1]-y[1])^2+ (x[2]-y[2])^2+ (x[3]-y[3])^2); # called d above
julia> x = rand(SVector{3,Float64},10_000);
julia> mydistmat_self_symm!(out, x) ≈ pairwise(Euclidean(), reduce(hcat,x), reduce(hcat,x), dims=2)
true
julia> @btime mydistmat_self_symm!($out, $x);
87.738 ms (0 allocations: 0 bytes)
julia> @btime mydistmat_self_symm_threads!($out, $x); # version with Threads.@threads, but it's not dividing the triangle well
54.583 ms (24 allocations: 5.22 KiB)
julia> @btime mydistmat_self!($out, $x); # fills the whole array, no Symmetric
176.911 ms (0 allocations: 0 bytes)
julia> @btime mydistmat_self_threads!($out, $x); # here closer to a factor of 2
105.831 ms (21 allocations: 5.12 KiB)
```

It’s very odd that these get slower for me with `@inbounds`

, for example:

```
julia> @btime mydistmat_self_symm!($out, $x);
124.784 ms (0 allocations: 0 bytes)
```