# HELP: Implementing K-means from scratch with Julia

Hi everyone,

I am currently stuck trying to port an existing Julia code. Basically a code that builds a simple implementation of the K-means clustering algorithm. Here’s the testing data generated from sklearn:

``````# Replace python environment to suit your needs
ENV["PYTHON"] = "YOUR-PYTHON-ENV"
Pkg.build("PyCall")  # Build PyCall to suit the specified Python env

using PyCall
using Statistics
using LinearAlgebra
using Plots

# import sklearn datasets
data = pyimport("sklearn.datasets")|

X, y = data.make_blobs(n_samples=100000, n_features=3, centers=3, cluster_std=0.9, random_state=80)
``````

The original Julia (which works well) is as follows:

``````function kmeans(x, k; maxiters = 100, tol = 1e-5)
x = collect(eachrow(x))
N = length(x)
n = length(x)
distances = zeros(N) # used to store the distance of each
# point to the nearest representative.
reps = [zeros(n) for j=1:k] # used to store representatives.

# ’assignment’ is an array of N integers between 1 and k.
# The initial assignment is chosen randomly.
assignment = [ rand(1:k) for i in 1:N ]

Jprevious = Inf # used in stopping condition

for iter = 1:maxiters

# Cluster j representative is average of points in cluster j.
for j = 1:k
group = [i for i=1:N if assignment[i] == j]
reps[j] = sum(x[group]) / length(group);
end;

# For each x[i], find distance to the nearest reprepresentative
# and its group index

for i = 1:N
(distances[i], assignment[i]) = findmin([norm(x[i] - reps[j]) for j = 1:k])
end;

# Compute clustering objective.
J = norm(distances)^2 / N

# Show progress and terminate if J stopped decreasing.
println("Iteration ", iter, ": Jclust = ", J, ".")

if iter > 1 && abs(J - Jprevious) < tol * J
return assignment, reps
end

Jprevious = J
end

end

kmeans(X, 3)
``````

I am trying to learn port this with some minor tweaks. Here’s my current implementation which is giving me an `ERROR: MethodError: Cannot `convert` an object of type Float64 to an object of type Array{Float64,1}`:

``````function Kmeans(X, k; max_iters = 300, tol = 1e-5)
# Reshape 2D array to a 1D array with length of all training examples
# where each example is of size (n, ) ie the new array is just a list of example array
X_array_list = collect(eachrow(X))

# Save some info on the incoming data
N = length(X_array_list)  # Length of all training examples
n = length(X_array_list)  # Length of a single training example
distances = zeros(N)  # Empty vector for all training examples. Useful later

# Step 1: Random initialization
reps_centroids = [zeros(n) for grp = 1:k]  # Initiate centroids for each
labels = rand(1:k, N)  # Randomly assign labels (between 1 to k) to all training examples

J_previous = Inf

for iter = 1:max_iters

# Step 2: Update the representative centroids for each group
for j = 1:k
# get group indices for each group
group_idx = [i for i = 1:N if labels[i] == j]

# use group indices to locate each group
reps_centroids[j] = mean(X_array_list[group_idx]);
end;

# Step 3: Update the group labels
for i = 1:N
# compute the distance between each example and the updated representative centroid
nearest_rep_distance = [norm(X_array_list[i] - reps_centroids[x]) for x = 1:k]

# update distances and label arrays with value and index of closest neighbour
# findmin returns the min value & index location
distances[i], labels[i] = findmin(nearest_rep_distance)
end;

# Step 4: Compute the clustering cost
J = (norm(distances)^ 2) / N

# Show progress and terminate if J stopped decreasing.
println("Iteration ", iter, ": Jclust = ", J, ".")

# Final Step 5: Check for convergence
if iter > 1 && abs(J - J_previous) < (tol * J)
# TODO: Calculate the sum of squares

# Terminate algorithm with the assumption that K-means has converged
return labels, reps_centroids

end

J_previous = J
end

end

Kmeans(X, 3)
``````

I have been scratching my head for 2 hours now as I can see where the error is coming from. What am I missing? Thanks in advance.

UPDATE: Bug has been fixed now Does the second version prints anything? Could you point the line of the function in which your code breaks? Providing an easier testable version that generates random data instead of loading with pycall might help others to test your code.

It breaks on `reps_centroids[j] = mean(X[group_idx]);`

I guess I failed to change the variable names It looks like `reps_centroids` is a vector of vectors. But you are setting elements of it to `mean(...)`, which is a scalar.

I’m not sure exactly the goals of your code so it’s difficult to give you more concrete help. But I think that’s your problem. Solutions might be to initialize `reps_centroids = [0 for grop = 1:k]`, or maybe you want your call to `mean` to actually return a vector of means. In which case you have to broadcast.

It’s been fixed now. The problem actually is that I failed to update calls to X_array_list after I changed the that variable name. And yes, the centroids is a vector of vectors and mean correctly assigns a vector for each group.