Optimize code by parallelization/GPU

Adegasel · October 11, 2022, 2:39pm

I have developed this transform method in order to assign vectors to previously computed centroids. It is currently taking ~20s on my machine, for 1M vectors and 256 clusters. Since I want to apply it to 1B vectors, this version would take 20.000s, i.e., more than 5h. Could it be speeded up by parallelizing the code, or taking advantage of the GPU?

using Clustering
using BenchmarkTools

function euclidean_mat(y, X, j) where T
    res = zero(eltype(y))
    @inbounds @fastmath @simd for k in eachindex(y)
        partial = X[k, j] - y[k]
        res += partial * partial
    end
    return res
end

function transform(X, R::KmeansResult)
    n_features, n_examples = size(X)
    cluster_assignments = Array{Int32, 1}(undef, n_examples)
    for n in 1:n_examples
        min_dist = typemax(eltype(X))
        cluster_assignment = Int32(0)
        for (j,c) in enumerate(eachcol(R.centers))
            dist = euclidean_mat(c, X, n)
            if dist < min_dist
                min_dist = dist
                cluster_assignment = j
            end
        end
        cluster_assignments[n] = cluster_assignment
    end
    return cluster_assignments
end


n_features = 128
n_examples = 1_000_000
n_clusters = 256

X = rand(Float32, n_features, n_examples)
R = kmeans(X, n_clusters ; maxiter=10, display=:iter)
@time transform(X,R)

davidbp · October 11, 2022, 3:00pm

Your script runs in my machine with 10.580 s (767869186 allocations: 19.08 GiB) .

The following code with threads runs in 3.8 sec, note that I just added Threads.@threads loop that iterates over n_examples.

It is still allocating a lot, I suspect it could be made faster avoiding all these allocations.

K-means terminated without convergence after 3 iterations (objv = 9.994752e6)
 3.840 s (767869250 allocations: 19.08 GiB)

New script

using Base.Threads
using Clustering
using BenchmarkTools

function euclidean_mat(y, X, j) where T
    res = zero(eltype(y))
    @inbounds @fastmath @simd for k in eachindex(y)
        partial = X[k, j] - y[k]
        res += partial * partial
    end
    return res
end

function transform(X, R::KmeansResult)
    n_features, n_examples = size(X)
    cluster_assignments = Array{Int32, 1}(undef, n_examples)
    Threads.@threads for n in 1:n_examples
        min_dist = typemax(eltype(X))
        cluster_assignment = Int32(0)
        for (j,c) in enumerate(eachcol(R.centers))
            dist = euclidean_mat(c, X, n)
            if dist < min_dist
                min_dist = dist
                cluster_assignment = j
            end
        end
        cluster_assignments[n] = cluster_assignment
    end
    return cluster_assignments
end


println("\nExecution with $(nthreads()) threads")
n_features = 128
n_examples = 1_000_000
n_clusters = 256

X = rand(Float32, n_features, n_examples)
R = kmeans(X, n_clusters ; maxiter=3, display=:iter)
@btime transform(X,R)

lmiq · October 11, 2022, 11:10pm

From your description of the problem in Slack, it is possible that this applies to the problem: Basic use · MolecularMinimumDistances.jl

Or something of the sort. This package computes all the minimum distances between two sets-of-sets of points (“molecules”, but that is a detail), within a cutoff.

davidbp · October 12, 2022, 4:53pm

After some slack discussions with @MirekKratochvil, @Daniel_Gonzalez and @lmiq we got to the following solution which is 10x faster and almost does not allocate anything. I tested this in a mac, probably in an intel machine I would add @simd in the for loop over n_features. Could you try this @Adegasel and let us know the improvement in your machine?

As a comment, there is no need to run the clustering for a MWE, just predefine random centroids. Besides, maybe adding in the title of the post something like ’ Optimize code of pairwise distance for centroids assigment with parallelization/GPU ’ could help people that know about the topic reach/find this post.

using Base.Threads
using BenchmarkTools

function transform!(X::Matrix{F}, centers::Matrix{F}, cluster_assignments::Vector{I}) where {F,I}
    n_features, n_examples = size(X)
    n_clusters = size(centers, 2)
    if length(cluster_assignments) != n_examples
        error("output size")
    end
    Threads.@threads :static for n in 1:n_examples
        min_dist = typemax(F)
        cluster_assignment = zero(I)
        for k in 1:n_clusters
            dist = zero(F)
            @fastmath for d=1:n_features
                @inbounds dist += (X[d,n]-centers[d,k])^2
            end
            if dist < min_dist
                min_dist = dist
                cluster_assignment = k
            end
        end
        @inbounds cluster_assignments[n] = cluster_assignment
    end
    nothing
end

n_features = 128
n_examples = 1000_000
n_clusters = 256

X = rand(Float32, n_features, n_examples)
#R = kmeans(X, n_clusters ; maxiter=3, display=:iter) No need to have this in a MWE
centers = rand(Float32, n_features, n_clusters)
assignments = zeros(Int32, n_examples)

println("\nExecution with $(nthreads()) threads\n")
benchmark = @benchmark transform!(X, centers, assignments)
display(benchmark)

This prints

Execution with 10 threads

 Range (min … max):  362.855 ms … 420.420 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     380.187 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   382.648 ms ±  16.640 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▁   ▁ ▁    ▁   ▁▁▁ █    ▁                ▁  ▁              ▁  
  ██▁▁▁█▁█▁▁▁▁█▁▁▁███▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  363 ms           Histogram: frequency by time          420 ms <

 Memory estimate: 5.59 KiB, allocs estimate: 61.

My previous version gave me

Execution with 10 threads

BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  3.031 s …    3.284 s  ┊ GC (min … max): 45.78% … 49.28%
 Time  (median):     3.158 s               ┊ GC (median):    47.60%
 Time  (mean ± σ):   3.158 s ± 178.783 ms  ┊ GC (mean ± σ):  47.60% ±  2.47%

  █                                                        █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  3.03 s         Histogram: frequency by time         3.28 s <

 Memory estimate: 19.08 GiB, allocs estimate: 767869249.

davidbp · October 12, 2022, 5:23pm

I was curious to see the performance with the equivalent function that I previously used for this in python


import numpy as np
import scipy
import time

n_features = 128
n_examples = 1000_000
n_clusters = 256

X = np.random.rand(n_examples, n_features).astype(np.float32)
Y = np.random.rand(n_clusters, n_features).astype(np.float32)

t0 = time.time()
aux = scipy.cluster.vq.vq(X, Y, check_finite=False)
print('time taken', time.time()-t0)

Which prints time taken 1.7729411125183105

lmiq · October 12, 2022, 5:35pm

You can put the type of distance measure outside the main function with this below. As that it is actually slightly faster (because of the @simd):

function norm_sqr(x::AbstractVector{T},y::AbstractVector{T}) where {T}
    dist = zero(T)
    @simd for i in eachindex(x,y)
        @inbounds dist += (x[i] - y[i])^2
    end
    return dist
end

function transform3!(
    by::F,
    X::Matrix{T},
    centers::Matrix{T},
    cluster_assignments::Vector{I};
) where {F<:Function,T,I<:Integer}
    length(cluster_assignments) != size(X,2) && throw(ArgumentError("output size != number of clusters"))
    Threads.@threads for n in axes(X,2)
        min_dist = typemax(T)
        cluster_assignment = 0
        for k in axes(centers, 2)
            dist = by(@view(X[:,n]),@view(centers[:,k]))
            if dist < min_dist
                min_dist = dist
                cluster_assignment = k
            end
        end
        cluster_assignments[n] = I(cluster_assignment)
    end
    return nothing
end

by the way: I’m not using @fastmath there.

full code

using Base.Threads
using BenchmarkTools

function transform!(X::Matrix{F}, centers::Matrix{F}, cluster_assignments::Vector{I}) where {F,I}
    n_features, n_examples = size(X)
    n_clusters = size(centers, 2)
    if length(cluster_assignments) != n_examples
        error("output size")
    end
    Threads.@threads :static for n in 1:n_examples
        min_dist = typemax(F)
        cluster_assignment = zero(I)
        for k in 1:n_clusters
            dist = zero(F)
            @fastmath for d=1:n_features
                @inbounds dist += (X[d,n]-centers[d,k])^2
            end
            if dist < min_dist
                min_dist = dist
                cluster_assignment = k
            end
        end
        @inbounds cluster_assignments[n] = cluster_assignment
    end
    nothing
end

function norm_sqr(x::AbstractVector{T},y::AbstractVector{T}) where {T}
    dist = zero(T)
    @simd for i in eachindex(x,y)
        @inbounds dist += (x[i] - y[i])^2
    end
    return dist
end

function transform3!(
    by::F,
    X::Matrix{T}, 
    centers::Matrix{T}, 
    cluster_assignments::Vector{I};
) where {F<:Function,T,I<:Integer}
    length(cluster_assignments) != size(X,2) && throw(ArgumentError("output size != number of clusters"))
    Threads.@threads for n in axes(X,2)
        min_dist = typemax(T)
        cluster_assignment = 0
        for k in axes(centers, 2)
            dist = by(@view(X[:,n]),@view(centers[:,k]))
            if dist < min_dist
                min_dist = dist
                cluster_assignment = k
            end
        end
        cluster_assignments[n] = I(cluster_assignment)
    end
    return nothing
end

function main()

    n_features = 128
    n_examples = 1000_000
    n_clusters = 256
    
    X = rand(Float32, n_features, n_examples)
    centers = rand(Float32, n_features, n_clusters)
    assignments0 = zeros(Int32, n_examples)
    assignments1 = zeros(Int32, n_examples)
    
    println("\nExecution with $(nthreads()) threads\n")
    
    transform!(X, centers, assignments0)
    transform3!(norm_sqr, X, centers, assignments1)
    println(assignments0 == assignments1)
    
    @btime transform3!($norm_sqr,$X, $centers, $assignments1)
    @btime transform!($X, $centers, $assignments0)

end

(execute with main())

davidbp · October 12, 2022, 8:27pm

In an apple silicon M1pro the version I posted is faster. I tried in an intel machine and your version is slightly faster (800 ms vs 860ms in a quad core i5 gen 10 CPU).

lmiq · October 12, 2022, 8:35pm

Significantly? Is the @fastmath the difference?

davidbp · October 12, 2022, 8:46pm

yes, @simd does almost nothing (I think the SIMD insturctions are at most 128 bit), actually it usually decrease a bit the performance or stay the same.

Topic		Replies	Views
Optimization tips for my julia code. Can I make it even faster and/or memory efficient? Performance question , python	24	4212	February 15, 2020
Improve performance of matrix computation Performance	10	1146	April 25, 2018
Parallel implementation is not very effective General Usage question , performance , parallel , multithreading	19	790	October 8, 2022
[ANN] ParallelKMeans v1.0.0 - KMeans In Super Sonic Mode Package Announcements	16	1191	May 26, 2021
Thinking through distance matrix calculation GPU gpu , multithreading	5	2756	May 16, 2017

Optimize code by parallelization/GPU

Related topics