# Optimization tips for my julia code. Can I make it even faster and/or memory efficient?

Hi everyone,

I have been learning Julia since January by trying to implement classic machine learning algorithms from scratch. My latest project is the KMeans clustering algorithm. I have been successful with this implementation (by comparing results with the implementation sklearn with Python).

The Python implementation (using `Python 3.81`, `sklearn 0.22.1`, and `IPython 7.12.0` ) is as follows;

``````from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=1000000, n_features=3,
centers=3, cluster_std=0.9, random_state=80)

def test_speed(x):
"""
Elbow method selection method for best k in terms of inertia
"""
for i in range(2, 11):

model = KMeans(n_clusters=i, init='k-means++', max_iter=300).fit(x)
l, c, s = model.labels_, model.cluster_centers_, model.inertia_

%timeit test_speed(X)
``````

which gives this result:

``````1min 20s ± 5.57 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
``````

My current Julia implementation is as follows:

``````using Pkg

# Replace python environment to suit your needs
ENV["PYTHON"] = "YOUR PYTHON ENVIRONMENT WITH NEEDED PACKAGES"
Pkg.build("PyCall")  # Build PyCall to suit the specified Python env

using PyCall
using Plots
using LinearAlgebra
using Statistics
using StatsBase
using BenchmarkTools
using Distances

# import the same data
data = pyimport("sklearn.datasets")

X, y = data.make_blobs(n_samples=1000000, n_features=3,
centers=3, cluster_std=0.9, random_state=80);

"""
smart_init(X, k; init="k-means++")

This function handles the random initialisation of the centroids from the
design matrix (X) and desired groups (k) that a user supplies.
"""
function smart_init(X::Array{Float64, 2}, k::Int; init::String="k-means++")
n_row, n_col = size(X)

if init == "k-means++"

# randonmly select the first centroid from the data (X)
centroids = zeros(k, n_col)
rand_idx = rand(1:n_row)
centroids[1, :] = X[rand_idx, :]

# compute distances from the first centroid chosen to all the other data points
first_centroid_matrix = convert(Matrix, centroids[1, :]')

# flatten distances
distances = vec(pairwise(Euclidean(), X, first_centroid_matrix, dims = 1))

for i = 2:k
# choose the next centroid, the probability for each data point to be chosen
# is directly proportional to its squared distance from the nearest centroid
prob = distances .^ 2
r_idx = sample(1:n_row, ProbabilityWeights(prob))
centroids[i, :] = X[r_idx, :]

# Ignore setting the last centroid to help the separation of centroids
if i == (k-1)
break
end

# compute distances from the centroids to all data points
current_centroid_matrix = convert(Matrix, centroids[i, :]')
new_distances = vec(pairwise(Euclidean(), X, current_centroid_matrix, dims = 1))

# and update the squared distance as the minimum distance to all centroid
distances = minimum([distances, new_distances])

end

else
# randomly select points from the design matrix as the initial centroids
rand_indices = rand(1:n_row, k)
centroids = X[rand_indices, :]

end

return centroids, n_row, n_col
end

"""
sum_of_squares(x, labels, centre, k)

This function computes the total sum of squares
"""
function sum_of_squares(x::Array{Float64,2}, labels::Array{Int64,1}, centre::Array, k::Int)
ss = 0

for j = 1:k
group_data = x[findall(labels .== j), :]
group_centroid_matrix  = convert(Matrix, centre[j, :]')
group_distance = pairwise(Euclidean(), group_data, group_centroid_matrix, dims=1)

ss += sum(group_distance .^ 2)
end

return ss
end

"""
Kmeans(design_matrix, k; k_init="k-means++", max_iters=300, tol=1e-4, verbose=true)

This main function employs the K-means algorithm to cluster all examples
in the training data (design_matrix) into k groups using either the
`k-means++` or random initialisation.
"""
function Kmeans(design_matrix::Array{Float64, 2}, k::Int64; k_init::String="k-means++",
max_iters::Int64=300, tol=1e-4, verbose::Bool=true)

centroids, n_row, n_col = smart_init(design_matrix, k, init=k_init)

labels = rand(1:k, n_row)
distances = zeros(n_row)

J_previous = Inf64

# Update centroids & labels with closest members until convergence
for iter = 1:max_iters
nearest_neighbour = pairwise(Euclidean(), design_matrix, centroids, dims=1)

min_val_idx = findmin.(eachrow(nearest_neighbour))

distances = [x for x in min_val_idx]
labels = [x for x in min_val_idx]

centroids = [ mean( X[findall(labels .== j), : ], dims = 1) for j = 1:k]
centroids = reduce(vcat, centroids)

# Cost objective
J = (norm(distances) ^ 2) / n_row

if verbose
# Show progress and terminate if J stopped decreasing.
println("Iteration ", iter, ": Jclust = ", J, ".")
end;

# Final Step: Check for convergence
if iter > 1 && abs(J - J_previous) < (tol * J)

sum_squares = sum_of_squares(design_matrix, labels, centroids, k)

# Terminate algorithm with the assumption that K-means has converged
if verbose
println("Successfully terminated with convergence.")
end

return labels, centroids, sum_squares

elseif iter == max_iters && abs(J - J_previous) > (tol * J)
throw(error("Failed to converge Check data and/or implementation or increase max_iter."))

end;

J_previous = J
end

end

# Test the speed of the implementation
function test_speed(x)
for i = 2:10
l, c, s = Kmeans(X, i, k_init="k-means++", verbose=false)
end
end

# Benchmark results
r = @benchmark test_speed(X) samples=7 seconds=300

``````

The benchmarking results are impressive as my-not-so-optimized implementation is already almost 5x faster than a heavily optimized Python library like sklearn;

``````BenchmarkTools.Trial:
memory estimate:  14.43 GiB
allocs estimate:  76011811
--------------
minimum time:     15.255 s (14.98% GC)
median time:      16.433 s (14.41% GC)
mean time:        16.422 s (14.80% GC)
maximum time:     17.681 s (17.37% GC)
--------------
samples:          7
evals/sample:     1
``````

Are there any optimization tips that can even squeeze more performance from Julia? Thanks in advance for even reading this long post! I love Julia!

2 Likes

There is a small typo in your gist, in `test_speed`:

``````l, c, s = Kmeans(X, i, k_init="k-means++", verbose=false)
``````

should be

``````l, c, s = Kmeans(x, i, k_init="k-means++", verbose=false)
``````

As for the optimization tips, you can profile your function

``````julia> using Profile

julia> @profile test_speed(X)
julia> Profile.print(format = :flat, sortedby = :count)
``````

which shows that most of the time code spends here

``````nearest_neighbour = pairwise(Euclidean(), design_matrix, centroids, dims=1)

min_val_idx = findmin.(eachrow(nearest_neighbour))
``````

One has to think how to optimize it, one way for example to use mutable version of `pairwise`, for example like this

``````# ... omitting previous lines
nearest_neighbour = Array{Float64, 2}(undef, size(design_matrix, 1), size(centroids, 1))

# Update centroids & labels with closest members until convergence
for iter = 1:max_iters
pairwise!(nearest_neighbour, Euclidean(), design_matrix, centroids, dims=1)
``````

It reduces allocations from 16Gib to 12Gib

1 Like

I see quite a bit of inefficient things going on in your code. You seem to create a lot of temporary arrays by slicing and copying. Also, it appears that the code favours row-major, while Julia is column-major.

I’m not so familiar with k-means, so I’ll just look at your sum-of-squares function. Lots of searching and slicing, operating along rows, temporary arrays that are not needed, calculating distances, which takes the square root, then squaring the numbers again afterwards.

Here I’ve just written a mostly mock-up of a more efficient design. I’m not confident that it’s correct, and I’m assuming that you change the orientation of your arrays so that things work along the columns:

``````function sumsq(x, labels, centre, k)
# remember to switch dims to columnwise
ss = 0.0 # maybe use some clever `zero(promote_type(...))` thing here
for i in eachindex(labels)
ind = labels[i]
!(1 <= ind <= k) && continue  # is this necessary, or do you use all labels?
for j in axes(x, 1) # inner loop along columns
ss += (x[j, i] - centre[j, ind])^2
end
end
return ss
end
``````

This may not work correctly, but it should give an idea for a direction. See that all operations are scalar, and just accumulating square differences everywhere. I did not test this, and it may not do exactly the same as your code, but it can give an idea for how to write fast code. This function should have zero allocation.

Hoping it’s not too far off. I tried to be clever about the labels, I think it ends up as the same thing as your search procedure 1 Like

Problem of `min_val_idx` can be solved by manual unrolling of this calculation (it’s fine, since it is bottleneck)

``````    labels = Vector{Int}(undef, n_row)
distances = Vector{Float64}(undef, n_row)

J_previous = Inf64

nearest_neighbour = Array{Float64, 2}(undef, size(design_matrix, 1), size(centroids, 1))

# Update centroids & labels with closest members until convergence
for iter = 1:max_iters
pairwise!(nearest_neighbour, Euclidean(), design_matrix, centroids, dims=1)

for i in axes(nearest_neighbour, 1)
labels[i] = 1
distances[i] = nearest_neighbour[i, 1]
for j in 2:size(nearest_neighbour, 2)
if distances[i] > nearest_neighbour[i, j]
labels[i] = j
distances[i] = nearest_neighbour[i, j]
end
end
end
``````

which gives

``````julia> @time test_speed(X)
12.490450 seconds (10.14 k allocations: 5.014 GiB, 3.18% gc time)
``````

It still allocates too much, yet we achieved 20% time decrease.

1 Like

By the way, if `Euclidean` returns square root of the distance, it would make sense to use squared version of this metric, because you will remove unnecessary calculation of square root (which takes it’s time) and also you can simplify calculation of the cost objective, instead of

`````` J = (norm(distances) ^ 2) / n_row
``````

you will get simply

``````J = mean(distances)
``````

Edit: yes, changing `Euclidean` to `SqEuclidean` and changing cost function, calculation time decreased on 1.5s.

Edit2: by optimizing `centroids = [ mean( X[findall(labels .== j), : ], dims = 1) for j = 1:k]` you can gain additional x2 speedup with average execution time ~ 5s.

1 Like

So many amazing suggestions. I’m currently going over them and updating the code. I will post the optimized code as well.

Edit it to what? It’s the same as the first one or am I missing something?

I am sorry, discourse told me, that I shouldn’t write so many posts in sequence and it’s better to edit previous. I didn’t write the changed code, left it as an exercise To optimize it you should use the same ideas - less allocations, less unnecessary runs through this huge matrix.

Of course if there will be any problems, I can show the solution.

1 Like

Unfortunately, my programming skills isn’t that advanced yet (hence all the temporary assignments for debugging). Please do share your more optimized solution.

To be honest, this optimizations quite the opposite of advanced in this case. Instead of writing sophisticated code, you go to more and more low level. This is rather sad, may be someone knows better way (and it should be, for sure)

So, somewhere at the beginning of your function you define

``````    centroids_cnt = Vector{Float64}(undef, size(centroids, 1))
``````

and instead of the comprehensions you use this

``````        centroids .= 0.0
centroids_cnt .= 0.0
for i in axes(design_matrix, 1)
for j in axes(design_matrix, 2)
centroids[labels[i], j] += design_matrix[i, j]
centroids_cnt[labels[i]] += 1.0
end
end
for i in axes(centroids, 1)
for j in axes(centroids, 2)
centroids[i, j] /= centroids_cnt[i]
end
end
``````

It may be a good idea, to pack into something like `update_centroids!(centroids, X, labels, centroids_cnt)`.
Re-use of `centroids_cnt` is overkill of course, I just eliminate all allocations.

I think, that it is possible to get something like ~1s of execution time in fully optimized code.

1 Like

At that point, the smart_init function has already selected the initial centroids. Wouldn’t `centroids .= 0.0` overwrite this at that point?

This can be replaced with `centroids ./= centroids_cnt`, with similar performance and no allocations. (Well, faster, I think, since you are iterating along the rows, while broadcast will know to iterate over columns.)

BTW, shouldn’t `centroid_cnt` be `Int`? Or do I misunderstand.

Furthermore, switching the dimensions should really help for the rest of the code.

1 Like

You right, I got carried away with looping.

`centroids_cnt` should be `Int` of course, but since it appears in division operation, I thought that it would be safer to use `Float` from the beginning. But it doesn’t affect performance at all, and it can be `Int`, since it’s more readable.

At this point you already used information from `smart_init`, so data that was stored in `centroids` is no longer needed. So, yes, you overwrite it. Main idea that original `centroids = [...]` did the same, but also allocate new memory region, which takes it’s time. On the other hand `centroids .= 0` and consequent loops use the same memory, so its faster.

That shouldn’t be an issue.

That makes to me now. I have so much to learn! It’s faster and the memory footprint is leaner. I’m so used to assignments but it’s clear that most of them are not needed especially at the sacrifice of speed.

I’m writing the blog post about my experiences with this implementation. I will definitely cite this thread for the optimization suggested. Thanks a lot for your time! 2 Likes

You are welcome, but to be honest, all I did was just using tips from Performance Tips, it’s all there.

2 Likes

Bit of an unrelated question but how can I convert scientific notation numbers (like `1.837790990638948e7`) to a normal float? `convert(Float64, 1.837790990638948e7)` seems not to be working.

``````julia> x
1.837790990638948e7

julia> typeof(x)
Float64

julia> using Printf

julia> @printf("%f", x)
18377909.906389
``````

It’s already a normal number. They just print using scientific notation by default, but you can use Printf if you’d like to print with different formatting.

2 Likes

There seems to be a bug which has been introduced by this optimized version. Try `Kmeans(X, 7)` to reproduce.