That seems like a lot of allocations. I think many of them are coming from your norm(x .- y)s — that allocates a temporary for x .- y on every call. That’s where I’m sure you could do significantly better, even on a plain old CPU. Changing those to Distances.euclidean from Distances.jl (and nothing else!) gets me down to 18s on my crappy old M1 laptop with way fewer allocations:
julia> @time radii, C_r = grassberger_procaccia(X)
250.868078 seconds (2.13 G allocations: 1.692 TiB, 37.04% gc time, 0.28% compilation time)
julia> @time radii, C_r = grassberger_procaccia_euclidean(X)
18.588317 seconds (862.84 k allocations: 128.668 MiB, 0.02% gc time, 1.61% compilation time)
You’ll also want to avoid using tid to divide up your threads work; see PSA: Thread-local state is no longer recommended; Common misconceptions about threadid() and nthreads() for more details there.