Optimize code by parallelization/GPU

I have developed this transform method in order to assign vectors to previously computed centroids. It is currently taking ~20s on my machine, for 1M vectors and 256 clusters. Since I want to apply it to 1B vectors, this version would take 20.000s, i.e., more than 5h. Could it be speeded up by parallelizing the code, or taking advantage of the GPU?

using Clustering
using BenchmarkTools

function euclidean_mat(y, X, j) where T
    res = zero(eltype(y))
    @inbounds @fastmath @simd for k in eachindex(y)
        partial = X[k, j] - y[k]
        res += partial * partial
    end
    return res
end

function transform(X, R::KmeansResult)
    n_features, n_examples = size(X)
    cluster_assignments = Array{Int32, 1}(undef, n_examples)
    for n in 1:n_examples
        min_dist = typemax(eltype(X))
        cluster_assignment = Int32(0)
        for (j,c) in enumerate(eachcol(R.centers))
            dist = euclidean_mat(c, X, n)
            if dist < min_dist
                min_dist = dist
                cluster_assignment = j
            end
        end
        cluster_assignments[n] = cluster_assignment
    end
    return cluster_assignments
end


n_features = 128
n_examples = 1_000_000
n_clusters = 256

X = rand(Float32, n_features, n_examples)
R = kmeans(X, n_clusters ; maxiter=10, display=:iter)
@time transform(X,R)

Your script runs in my machine with 10.580 s (767869186 allocations: 19.08 GiB) .

The following code with threads runs in 3.8 sec, note that I just added Threads.@threads loop that iterates over n_examples.

It is still allocating a lot, I suspect it could be made faster avoiding all these allocations.

K-means terminated without convergence after 3 iterations (objv = 9.994752e6)
 3.840 s (767869250 allocations: 19.08 GiB)

New script

using Base.Threads
using Clustering
using BenchmarkTools

function euclidean_mat(y, X, j) where T
    res = zero(eltype(y))
    @inbounds @fastmath @simd for k in eachindex(y)
        partial = X[k, j] - y[k]
        res += partial * partial
    end
    return res
end

function transform(X, R::KmeansResult)
    n_features, n_examples = size(X)
    cluster_assignments = Array{Int32, 1}(undef, n_examples)
    Threads.@threads for n in 1:n_examples
        min_dist = typemax(eltype(X))
        cluster_assignment = Int32(0)
        for (j,c) in enumerate(eachcol(R.centers))
            dist = euclidean_mat(c, X, n)
            if dist < min_dist
                min_dist = dist
                cluster_assignment = j
            end
        end
        cluster_assignments[n] = cluster_assignment
    end
    return cluster_assignments
end


println("\nExecution with $(nthreads()) threads")
n_features = 128
n_examples = 1_000_000
n_clusters = 256

X = rand(Float32, n_features, n_examples)
R = kmeans(X, n_clusters ; maxiter=3, display=:iter)
@btime transform(X,R)

From your description of the problem in Slack, it is possible that this applies to the problem: Basic use Β· MolecularMinimumDistances.jl

Or something of the sort. This package computes all the minimum distances between two sets-of-sets of points (β€œmolecules”, but that is a detail), within a cutoff.

After some slack discussions with @MirekKratochvil, @Daniel_Gonzalez and @lmiq we got to the following solution which is 10x faster and almost does not allocate anything. I tested this in a mac, probably in an intel machine I would add @simd in the for loop over n_features. Could you try this @Adegasel and let us know the improvement in your machine?

As a comment, there is no need to run the clustering for a MWE, just predefine random centroids. Besides, maybe adding in the title of the post something like ’ Optimize code of pairwise distance for centroids assigment with parallelization/GPU ’ could help people that know about the topic reach/find this post.

using Base.Threads
using BenchmarkTools

function transform!(X::Matrix{F}, centers::Matrix{F}, cluster_assignments::Vector{I}) where {F,I}
    n_features, n_examples = size(X)
    n_clusters = size(centers, 2)
    if length(cluster_assignments) != n_examples
        error("output size")
    end
    Threads.@threads :static for n in 1:n_examples
        min_dist = typemax(F)
        cluster_assignment = zero(I)
        for k in 1:n_clusters
            dist = zero(F)
            @fastmath for d=1:n_features
                @inbounds dist += (X[d,n]-centers[d,k])^2
            end
            if dist < min_dist
                min_dist = dist
                cluster_assignment = k
            end
        end
        @inbounds cluster_assignments[n] = cluster_assignment
    end
    nothing
end

n_features = 128
n_examples = 1000_000
n_clusters = 256

X = rand(Float32, n_features, n_examples)
#R = kmeans(X, n_clusters ; maxiter=3, display=:iter) No need to have this in a MWE
centers = rand(Float32, n_features, n_clusters)
assignments = zeros(Int32, n_examples)

println("\nExecution with $(nthreads()) threads\n")
benchmark = @benchmark transform!(X, centers, assignments)
display(benchmark)

This prints

Execution with 10 threads

 Range (min … max):  362.855 ms … 420.420 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     380.187 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   382.648 ms Β±  16.640 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  ▁▁   ▁ ▁    ▁   ▁▁▁ β–ˆ    ▁                ▁  ▁              ▁  
  β–ˆβ–ˆβ–β–β–β–ˆβ–β–ˆβ–β–β–β–β–ˆβ–β–β–β–ˆβ–ˆβ–ˆβ–β–ˆβ–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  363 ms           Histogram: frequency by time          420 ms <

 Memory estimate: 5.59 KiB, allocs estimate: 61.

My previous version gave me

Execution with 10 threads

BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  3.031 s …    3.284 s  β”Š GC (min … max): 45.78% … 49.28%
 Time  (median):     3.158 s               β”Š GC (median):    47.60%
 Time  (mean Β± Οƒ):   3.158 s Β± 178.783 ms  β”Š GC (mean Β± Οƒ):  47.60% Β±  2.47%

  β–ˆ                                                        β–ˆ  
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  3.03 s         Histogram: frequency by time         3.28 s <

 Memory estimate: 19.08 GiB, allocs estimate: 767869249.

I was curious to see the performance with the equivalent function that I previously used for this in python


import numpy as np
import scipy
import time

n_features = 128
n_examples = 1000_000
n_clusters = 256

X = np.random.rand(n_examples, n_features).astype(np.float32)
Y = np.random.rand(n_clusters, n_features).astype(np.float32)

t0 = time.time()
aux = scipy.cluster.vq.vq(X, Y, check_finite=False)
print('time taken', time.time()-t0)

Which prints time taken 1.7729411125183105

You can put the type of distance measure outside the main function with this below. As that it is actually slightly faster (because of the @simd):

function norm_sqr(x::AbstractVector{T},y::AbstractVector{T}) where {T}
    dist = zero(T)
    @simd for i in eachindex(x,y)
        @inbounds dist += (x[i] - y[i])^2
    end
    return dist
end

function transform3!(
    by::F,
    X::Matrix{T},
    centers::Matrix{T},
    cluster_assignments::Vector{I};
) where {F<:Function,T,I<:Integer}
    length(cluster_assignments) != size(X,2) && throw(ArgumentError("output size != number of clusters"))
    Threads.@threads for n in axes(X,2)
        min_dist = typemax(T)
        cluster_assignment = 0
        for k in axes(centers, 2)
            dist = by(@view(X[:,n]),@view(centers[:,k]))
            if dist < min_dist
                min_dist = dist
                cluster_assignment = k
            end
        end
        cluster_assignments[n] = I(cluster_assignment)
    end
    return nothing
end

by the way: I’m not using @fastmath there.

full code
using Base.Threads
using BenchmarkTools

function transform!(X::Matrix{F}, centers::Matrix{F}, cluster_assignments::Vector{I}) where {F,I}
    n_features, n_examples = size(X)
    n_clusters = size(centers, 2)
    if length(cluster_assignments) != n_examples
        error("output size")
    end
    Threads.@threads :static for n in 1:n_examples
        min_dist = typemax(F)
        cluster_assignment = zero(I)
        for k in 1:n_clusters
            dist = zero(F)
            @fastmath for d=1:n_features
                @inbounds dist += (X[d,n]-centers[d,k])^2
            end
            if dist < min_dist
                min_dist = dist
                cluster_assignment = k
            end
        end
        @inbounds cluster_assignments[n] = cluster_assignment
    end
    nothing
end

function norm_sqr(x::AbstractVector{T},y::AbstractVector{T}) where {T}
    dist = zero(T)
    @simd for i in eachindex(x,y)
        @inbounds dist += (x[i] - y[i])^2
    end
    return dist
end

function transform3!(
    by::F,
    X::Matrix{T}, 
    centers::Matrix{T}, 
    cluster_assignments::Vector{I};
) where {F<:Function,T,I<:Integer}
    length(cluster_assignments) != size(X,2) && throw(ArgumentError("output size != number of clusters"))
    Threads.@threads for n in axes(X,2)
        min_dist = typemax(T)
        cluster_assignment = 0
        for k in axes(centers, 2)
            dist = by(@view(X[:,n]),@view(centers[:,k]))
            if dist < min_dist
                min_dist = dist
                cluster_assignment = k
            end
        end
        cluster_assignments[n] = I(cluster_assignment)
    end
    return nothing
end

function main()

    n_features = 128
    n_examples = 1000_000
    n_clusters = 256
    
    X = rand(Float32, n_features, n_examples)
    centers = rand(Float32, n_features, n_clusters)
    assignments0 = zeros(Int32, n_examples)
    assignments1 = zeros(Int32, n_examples)
    
    println("\nExecution with $(nthreads()) threads\n")
    
    transform!(X, centers, assignments0)
    transform3!(norm_sqr, X, centers, assignments1)
    println(assignments0 == assignments1)
    
    @btime transform3!($norm_sqr,$X, $centers, $assignments1)
    @btime transform!($X, $centers, $assignments0)

end

(execute with main())

In an apple silicon M1pro the version I posted is faster. I tried in an intel machine and your version is slightly faster (800 ms vs 860ms in a quad core i5 gen 10 CPU).

1 Like

Significantly? Is the @fastmath the difference?

yes, @simd does almost nothing (I think the SIMD insturctions are at most 128 bit), actually it usually decrease a bit the performance or stays the same.