Optimization tips for my julia code. Can I make it even faster and/or memory efficient?

Hi everyone,

I have been learning Julia since January by trying to implement classic machine learning algorithms from scratch. My latest project is the KMeans clustering algorithm. I have been successful with this implementation (by comparing results with the implementation sklearn with Python).

The Python implementation (using Python 3.81, sklearn 0.22.1, and IPython 7.12.0 ) is as follows;

from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=1000000, n_features=3, 
                  centers=3, cluster_std=0.9, random_state=80)


def test_speed(x):
    """
    Elbow method selection method for best k in terms of inertia
    """
    for i in range(2, 11):
        
        model = KMeans(n_clusters=i, init='k-means++', max_iter=300).fit(x)
        l, c, s = model.labels_, model.cluster_centers_, model.inertia_


%timeit test_speed(X)

which gives this result:

1min 20s ± 5.57 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

My current Julia implementation is as follows:

using Pkg

# Replace python environment to suit your needs
ENV["PYTHON"] = "YOUR PYTHON ENVIRONMENT WITH NEEDED PACKAGES"
Pkg.build("PyCall")  # Build PyCall to suit the specified Python env

using PyCall
using Plots
using LinearAlgebra
using Statistics
using StatsBase
using BenchmarkTools
using Distances


# import the same data
data = pyimport("sklearn.datasets")

X, y = data.make_blobs(n_samples=1000000, n_features=3, 
                       centers=3, cluster_std=0.9, random_state=80);


""" 
    smart_init(X, k; init="k-means++")

    This function handles the random initialisation of the centroids from the
    design matrix (X) and desired groups (k) that a user supplies. 
"""
function smart_init(X::Array{Float64, 2}, k::Int; init::String="k-means++")
    n_row, n_col = size(X)

    if init == "k-means++"
        
        # randonmly select the first centroid from the data (X)
        centroids = zeros(k, n_col)
        rand_idx = rand(1:n_row)
        centroids[1, :] = X[rand_idx, :]

        # compute distances from the first centroid chosen to all the other data points
        first_centroid_matrix = convert(Matrix, centroids[1, :]')
        
        # flatten distances
        distances = vec(pairwise(Euclidean(), X, first_centroid_matrix, dims = 1))

        for i = 2:k
            # choose the next centroid, the probability for each data point to be chosen
            # is directly proportional to its squared distance from the nearest centroid
            prob = distances .^ 2
            r_idx = sample(1:n_row, ProbabilityWeights(prob))
            centroids[i, :] = X[r_idx, :]
            
            # Ignore setting the last centroid to help the separation of centroids
            if i == (k-1)
                break
            end

            # compute distances from the centroids to all data points
            current_centroid_matrix = convert(Matrix, centroids[i, :]')
            new_distances = vec(pairwise(Euclidean(), X, current_centroid_matrix, dims = 1))
            
            # and update the squared distance as the minimum distance to all centroid
            distances = minimum([distances, new_distances])

        end

    else
        # randomly select points from the design matrix as the initial centroids
        rand_indices = rand(1:n_row, k)
        centroids = X[rand_indices, :]

    end

    return centroids, n_row, n_col
end


""" 
    sum_of_squares(x, labels, centre, k)
    
    This function computes the total sum of squares
"""
function sum_of_squares(x::Array{Float64,2}, labels::Array{Int64,1}, centre::Array, k::Int)
    ss = 0

    for j = 1:k
        group_data = x[findall(labels .== j), :]
        group_centroid_matrix  = convert(Matrix, centre[j, :]')
        group_distance = pairwise(Euclidean(), group_data, group_centroid_matrix, dims=1)

        ss += sum(group_distance .^ 2)
    end

    return ss
end


""" 
    Kmeans(design_matrix, k; k_init="k-means++", max_iters=300, tol=1e-4, verbose=true)

    This main function employs the K-means algorithm to cluster all examples 
    in the training data (design_matrix) into k groups using either the 
    `k-means++` or random initialisation.
"""
function Kmeans(design_matrix::Array{Float64, 2}, k::Int64; k_init::String="k-means++",
    max_iters::Int64=300, tol=1e-4, verbose::Bool=true)

    centroids, n_row, n_col = smart_init(design_matrix, k, init=k_init)

    labels = rand(1:k, n_row)
    distances = zeros(n_row)

    J_previous = Inf64

    # Update centroids & labels with closest members until convergence
    for iter = 1:max_iters
        nearest_neighbour = pairwise(Euclidean(), design_matrix, centroids, dims=1)

        min_val_idx = findmin.(eachrow(nearest_neighbour))

        distances = [x[1] for x in min_val_idx]
        labels = [x[2] for x in min_val_idx]

        centroids = [ mean( X[findall(labels .== j), : ], dims = 1) for j = 1:k]
        centroids = reduce(vcat, centroids)
        
        # Cost objective
        J = (norm(distances) ^ 2) / n_row

        if verbose
            # Show progress and terminate if J stopped decreasing.
            println("Iteration ", iter, ": Jclust = ", J, ".")
        end;

        # Final Step: Check for convergence
        if iter > 1 && abs(J - J_previous) < (tol * J)

            sum_squares = sum_of_squares(design_matrix, labels, centroids, k)
            
            # Terminate algorithm with the assumption that K-means has converged
            if verbose
                println("Successfully terminated with convergence.")
            end

            return labels, centroids, sum_squares

        elseif iter == max_iters && abs(J - J_previous) > (tol * J)
            throw(error("Failed to converge Check data and/or implementation or increase max_iter."))
            
        end;

        J_previous = J
    end

end

# Test the speed of the implementation
function test_speed(x)
    for i = 2:10
        l, c, s = Kmeans(X, i, k_init="k-means++", verbose=false)
    end
end

# Benchmark results
r = @benchmark test_speed(X) samples=7 seconds=300

The benchmarking results are impressive as my-not-so-optimized implementation is already almost 5x faster than a heavily optimized Python library like sklearn;

BenchmarkTools.Trial: 
  memory estimate:  14.43 GiB
  allocs estimate:  76011811
  --------------
  minimum time:     15.255 s (14.98% GC)
  median time:      16.433 s (14.41% GC)
  mean time:        16.422 s (14.80% GC)
  maximum time:     17.681 s (17.37% GC)
  --------------
  samples:          7
  evals/sample:     1

Are there any optimization tips that can even squeeze more performance from Julia? Thanks in advance for even reading this long post! I love Julia!

2 Likes

There is a small typo in your gist, in test_speed:

l, c, s = Kmeans(X, i, k_init="k-means++", verbose=false)

should be

l, c, s = Kmeans(x, i, k_init="k-means++", verbose=false)

As for the optimization tips, you can profile your function

julia> using Profile

julia> @profile test_speed(X)
julia> Profile.print(format = :flat, sortedby = :count)

which shows that most of the time code spends here

nearest_neighbour = pairwise(Euclidean(), design_matrix, centroids, dims=1)

min_val_idx = findmin.(eachrow(nearest_neighbour))

One has to think how to optimize it, one way for example to use mutable version of pairwise, for example like this

# ... omitting previous lines
    nearest_neighbour = Array{Float64, 2}(undef, size(design_matrix, 1), size(centroids, 1))

    # Update centroids & labels with closest members until convergence
    for iter = 1:max_iters
        pairwise!(nearest_neighbour, Euclidean(), design_matrix, centroids, dims=1)

It reduces allocations from 16Gib to 12Gib

1 Like

I see quite a bit of inefficient things going on in your code. You seem to create a lot of temporary arrays by slicing and copying. Also, it appears that the code favours row-major, while Julia is column-major.

I’m not so familiar with k-means, so I’ll just look at your sum-of-squares function. Lots of searching and slicing, operating along rows, temporary arrays that are not needed, calculating distances, which takes the square root, then squaring the numbers again afterwards.

Here I’ve just written a mostly mock-up of a more efficient design. I’m not confident that it’s correct, and I’m assuming that you change the orientation of your arrays so that things work along the columns:

function sumsq(x, labels, centre, k)
    # remember to switch dims to columnwise
    ss = 0.0 # maybe use some clever `zero(promote_type(...))` thing here
    for i in eachindex(labels)
        ind = labels[i]
        !(1 <= ind <= k) && continue  # is this necessary, or do you use all labels?
        for j in axes(x, 1) # inner loop along columns
            ss += (x[j, i] - centre[j, ind])^2
        end
    end
    return ss
end

This may not work correctly, but it should give an idea for a direction. See that all operations are scalar, and just accumulating square differences everywhere. I did not test this, and it may not do exactly the same as your code, but it can give an idea for how to write fast code. This function should have zero allocation.

Hoping it’s not too far off. I tried to be clever about the labels, I think it ends up as the same thing as your search procedure :grimacing:

1 Like

Problem of min_val_idx can be solved by manual unrolling of this calculation (it’s fine, since it is bottleneck)

    labels = Vector{Int}(undef, n_row)
    distances = Vector{Float64}(undef, n_row)

    J_previous = Inf64

    nearest_neighbour = Array{Float64, 2}(undef, size(design_matrix, 1), size(centroids, 1))

    # Update centroids & labels with closest members until convergence
    for iter = 1:max_iters
        pairwise!(nearest_neighbour, Euclidean(), design_matrix, centroids, dims=1)

        for i in axes(nearest_neighbour, 1)
            labels[i] = 1
            distances[i] = nearest_neighbour[i, 1]
            for j in 2:size(nearest_neighbour, 2)
                if distances[i] > nearest_neighbour[i, j]
                    labels[i] = j
                    distances[i] = nearest_neighbour[i, j]
                end
            end
        end

which gives

julia> @time test_speed(X)
12.490450 seconds (10.14 k allocations: 5.014 GiB, 3.18% gc time)

It still allocates too much, yet we achieved 20% time decrease.

1 Like

By the way, if Euclidean returns square root of the distance, it would make sense to use squared version of this metric, because you will remove unnecessary calculation of square root (which takes it’s time) and also you can simplify calculation of the cost objective, instead of

 J = (norm(distances) ^ 2) / n_row

you will get simply

J = mean(distances)

Edit: yes, changing Euclidean to SqEuclidean and changing cost function, calculation time decreased on 1.5s.

Edit2: by optimizing centroids = [ mean( X[findall(labels .== j), : ], dims = 1) for j = 1:k] you can gain additional x2 speedup with average execution time ~ 5s.

1 Like

So many amazing suggestions. I’m currently going over them and updating the code. I will post the optimized code as well.

Edit it to what? It’s the same as the first one or am I missing something?

I am sorry, discourse told me, that I shouldn’t write so many posts in sequence and it’s better to edit previous. I didn’t write the changed code, left it as an exercise :slight_smile: To optimize it you should use the same ideas - less allocations, less unnecessary runs through this huge matrix.

Of course if there will be any problems, I can show the solution.

1 Like

Unfortunately, my programming skills isn’t that advanced yet (hence all the temporary assignments for debugging). Please do share your more optimized solution.

To be honest, this optimizations quite the opposite of advanced in this case. Instead of writing sophisticated code, you go to more and more low level. This is rather sad, may be someone knows better way (and it should be, for sure)

So, somewhere at the beginning of your function you define

    centroids_cnt = Vector{Float64}(undef, size(centroids, 1))

and instead of the comprehensions you use this

        centroids .= 0.0
        centroids_cnt .= 0.0
        for i in axes(design_matrix, 1)
            for j in axes(design_matrix, 2)
                centroids[labels[i], j] += design_matrix[i, j]
                centroids_cnt[labels[i]] += 1.0
            end
        end
        for i in axes(centroids, 1)
            for j in axes(centroids, 2)
                centroids[i, j] /= centroids_cnt[i]
            end
        end

It may be a good idea, to pack into something like update_centroids!(centroids, X, labels, centroids_cnt).
Re-use of centroids_cnt is overkill of course, I just eliminate all allocations.

I think, that it is possible to get something like ~1s of execution time in fully optimized code.

1 Like

At that point, the smart_init function has already selected the initial centroids. Wouldn’t centroids .= 0.0 overwrite this at that point?

This can be replaced with centroids ./= centroids_cnt, with similar performance and no allocations. (Well, faster, I think, since you are iterating along the rows, while broadcast will know to iterate over columns.)

BTW, shouldn’t centroid_cnt be Int? Or do I misunderstand.

Furthermore, switching the dimensions should really help for the rest of the code.

1 Like

You right, I got carried away with looping.

centroids_cnt should be Int of course, but since it appears in division operation, I thought that it would be safer to use Float from the beginning. But it doesn’t affect performance at all, and it can be Int, since it’s more readable.

At this point you already used information from smart_init, so data that was stored in centroids is no longer needed. So, yes, you overwrite it. Main idea that original centroids = [...] did the same, but also allocate new memory region, which takes it’s time. On the other hand centroids .= 0 and consequent loops use the same memory, so its faster.

Full version here: https://gist.github.com/Arkoniak/f8abdc8dc89e4a3b501c781eac4e739d

That shouldn’t be an issue.

That makes to me now. I have so much to learn! It’s faster and the memory footprint is leaner. I’m so used to assignments but it’s clear that most of them are not needed especially at the sacrifice of speed.

I’m writing the blog post about my experiences with this implementation. I will definitely cite this thread for the optimization suggested. Thanks a lot for your time! :love_you_gesture:

2 Likes

You are welcome, but to be honest, all I did was just using tips from Performance Tips, it’s all there.

2 Likes

Bit of an unrelated question but how can I convert scientific notation numbers (like 1.837790990638948e7) to a normal float? convert(Float64, 1.837790990638948e7) seems not to be working.

julia> x
1.837790990638948e7

julia> typeof(x)
Float64

julia> using Printf

julia> @printf("%f", x)
18377909.906389

It’s already a normal number. They just print using scientific notation by default, but you can use Printf if you’d like to print with different formatting.

2 Likes

There seems to be a bug which has been introduced by this optimized version. Try Kmeans(X, 7) to reproduce.