After some slack discussions with @MirekKratochvil, @Daniel_Gonzalez and @lmiq we got to the following solution which is 10x faster and almost does not allocate anything. I tested this in a mac, probably in an intel machine I would add `@simd`

in the for loop over n_features. Could you try this @Adegasel and let us know the improvement in your machine?

As a comment, there is no need to run the clustering for a MWE, just predefine random centroids. Besides, maybe adding in the title of the post something like **β Optimize code of pairwise distance for centroids assigment with parallelization/GPU β** could help people that know about the topic reach/find this post.

```
using Base.Threads
using BenchmarkTools
function transform!(X::Matrix{F}, centers::Matrix{F}, cluster_assignments::Vector{I}) where {F,I}
n_features, n_examples = size(X)
n_clusters = size(centers, 2)
if length(cluster_assignments) != n_examples
error("output size")
end
Threads.@threads :static for n in 1:n_examples
min_dist = typemax(F)
cluster_assignment = zero(I)
for k in 1:n_clusters
dist = zero(F)
@fastmath for d=1:n_features
@inbounds dist += (X[d,n]-centers[d,k])^2
end
if dist < min_dist
min_dist = dist
cluster_assignment = k
end
end
@inbounds cluster_assignments[n] = cluster_assignment
end
nothing
end
n_features = 128
n_examples = 1000_000
n_clusters = 256
X = rand(Float32, n_features, n_examples)
#R = kmeans(X, n_clusters ; maxiter=3, display=:iter) No need to have this in a MWE
centers = rand(Float32, n_features, n_clusters)
assignments = zeros(Int32, n_examples)
println("\nExecution with $(nthreads()) threads\n")
benchmark = @benchmark transform!(X, centers, assignments)
display(benchmark)
```

This prints

```
Execution with 10 threads
Range (min β¦ max): 362.855 ms β¦ 420.420 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 380.187 ms β GC (median): 0.00%
Time (mean Β± Ο): 382.648 ms Β± 16.640 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ β β β βββ β β β β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
363 ms Histogram: frequency by time 420 ms <
Memory estimate: 5.59 KiB, allocs estimate: 61.
```

My previous version gave me

```
Execution with 10 threads
BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min β¦ max): 3.031 s β¦ 3.284 s β GC (min β¦ max): 45.78% β¦ 49.28%
Time (median): 3.158 s β GC (median): 47.60%
Time (mean Β± Ο): 3.158 s Β± 178.783 ms β GC (mean Β± Ο): 47.60% Β± 2.47%
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
3.03 s Histogram: frequency by time 3.28 s <
Memory estimate: 19.08 GiB, allocs estimate: 767869249.
```