Speedup optimization in nested for loop

I have adapted some python code that takes data which is G x N, and returns the (strictly upper triangular) matrix cost_permuted whose (i,j)-th entry represents the minimum of cooscillation_distance(data[i, :], data[j, :], Ψ) for Ψ ∈ (0, 2π):

function cooscillation_distance(X, Y, Ψ)
    sum(@. (X^2 + Y^2 - 2cos(Ψ) * X * Y - sin(Ψ)^2)^2 )
end

As an extension to this, I am employing a bootstrap approach where I randomly permute Y, n times, and compute the minimum cooscillation distance for each permutation:

using Random, Optim, FLoops

function get_permuted_cost(
    data,
    n_permutations::Int
)
    @assert n_permutations > 1 "Number of bootstrap permutations must be greater than 1"
    cost_permuted = Array{Float64}(
        undef,
        first(size(data)),
        first(size(data)),
        n_permutations
    )

    @inbounds @views begin
        for ix in 1:first(size(data))
            @floop for iy in (ix+1):first(size(data))
                for n in 1:n_permutations
                    y_random = shuffle(data[iy, :])
                    res = optimize(x->cooscillation_distance(data[ix,:],y_random, x), 0.0, 2pi)
                    cost_permuted[ix, iy, n] = Optim.minimum(res)
                end
            end
        end
    end
    
    return cost_permuted
end

This (probably naive) implementation is quite slow, e.g

Random.seed!(1234)
G = 500
N = 1000
n = 10

@time get_permuted_cost(rand(G,N),n); # 108.399799 seconds (23.69 M allocations: 160.340 GiB, 3.97% gc time)

and I would like to speed it up so it runs in a reasonable time (i.e. <1min for G = 500, N = 1000, n = 100). Does anyone have any suggestions for how I could do this?

I think you can spare a lot of cos and sin evaluations by using a variable cospsi varying between -1 and 1 and replace the sin(psi)^2 by 1- cospsi^2 and of course cos(psi) by cospsi.(actually with your formulation I think you should not look in 0 2pi but 0 pi (since the sin is squared anyway)

On my computer I gain a factor 4.

function cooscillation_distance(X, Y, cosΨ)
    sum(@. (X^2 + Y^2 - 2cosΨ * X * Y - 1+cosΨ^2 ))
end
using Random, Optim, FLoops

function get_permuted_cost(
    data,
    n_permutations::Int
)
    @assert n_permutations > 1 "Number of bootstrap permutations must be greater than 1"
    cost_permuted = Array{Float64}(
        undef,
        first(size(data)),
        first(size(data)),
        n_permutations
    )

    
    @inbounds @views begin
        for ix in 1:first(size(data))
            @floop for iy in (ix+1):first(size(data))
                for n in 1:n_permutations
                    y_random = shuffle(data[iy, :])
                    res = optimize(x->cooscillation_distance(data[ix,:],y_random, x), -1.0,1.0)
                    cost_permuted[ix, iy, n] = Optim.minimum(res)
                end
            end
        end
    end
    
    return cost_permuted
end