I have adapted some python code that takes `data`

which is `G x N`

, and returns the (strictly upper triangular) matrix `cost_permuted`

whose `(i,j)`

-th entry represents the minimum of `cooscillation_distance(data[i, :], data[j, :], Ψ)`

for `Ψ ∈ (0, 2π)`

:

```
function cooscillation_distance(X, Y, Ψ)
sum(@. (X^2 + Y^2 - 2cos(Ψ) * X * Y - sin(Ψ)^2)^2 )
end
```

As an extension to this, I am employing a bootstrap approach where I randomly permute `Y`

, `n`

times, and compute the minimum cooscillation distance for each permutation:

```
using Random, Optim, FLoops
function get_permuted_cost(
data,
n_permutations::Int
)
@assert n_permutations > 1 "Number of bootstrap permutations must be greater than 1"
cost_permuted = Array{Float64}(
undef,
first(size(data)),
first(size(data)),
n_permutations
)
@inbounds @views begin
for ix in 1:first(size(data))
@floop for iy in (ix+1):first(size(data))
for n in 1:n_permutations
y_random = shuffle(data[iy, :])
res = optimize(x->cooscillation_distance(data[ix,:],y_random, x), 0.0, 2pi)
cost_permuted[ix, iy, n] = Optim.minimum(res)
end
end
end
end
return cost_permuted
end
```

This (probably naive) implementation is quite slow, e.g

```
Random.seed!(1234)
G = 500
N = 1000
n = 10
@time get_permuted_cost(rand(G,N),n); # 108.399799 seconds (23.69 M allocations: 160.340 GiB, 3.97% gc time)
```

and I would like to speed it up so it runs in a reasonable time (i.e. <1min for G = 500, N = 1000, n = 100). Does anyone have any suggestions for how I could do this?