HELP: Implementing K-means from scratch with Julia

Hi everyone,

I am currently stuck trying to port an existing Julia code. Basically a code that builds a simple implementation of the K-means clustering algorithm. Here’s the testing data generated from sklearn:

# Replace python environment to suit your needs
ENV["PYTHON"] = "YOUR-PYTHON-ENV"
Pkg.build("PyCall")  # Build PyCall to suit the specified Python env

using PyCall
using Statistics
using LinearAlgebra
using Plots

# import sklearn datasets
data = pyimport("sklearn.datasets")|

X, y = data.make_blobs(n_samples=100000, n_features=3, centers=3, cluster_std=0.9, random_state=80)

The original Julia (which works well) is as follows:

function kmeans(x, k; maxiters = 100, tol = 1e-5)
    x = collect(eachrow(x))
    N = length(x)
    n = length(x[1])
    distances = zeros(N) # used to store the distance of each
    # point to the nearest representative.
    reps = [zeros(n) for j=1:k] # used to store representatives.

    # ’assignment’ is an array of N integers between 1 and k.
    # The initial assignment is chosen randomly.
    assignment = [ rand(1:k) for i in 1:N ]

    Jprevious = Inf # used in stopping condition

    for iter = 1:maxiters

        # Cluster j representative is average of points in cluster j.
        for j = 1:k
            group = [i for i=1:N if assignment[i] == j]
            reps[j] = sum(x[group]) / length(group);
        end;

        # For each x[i], find distance to the nearest reprepresentative
        # and its group index

        for i = 1:N
            (distances[i], assignment[i]) = findmin([norm(x[i] - reps[j]) for j = 1:k])
        end;

        # Compute clustering objective.
        J = norm(distances)^2 / N

        # Show progress and terminate if J stopped decreasing.
        println("Iteration ", iter, ": Jclust = ", J, ".")

        if iter > 1 && abs(J - Jprevious) < tol * J
            return assignment, reps
        end

        Jprevious = J
    end

end

kmeans(X, 3)

I am trying to learn port this with some minor tweaks. Here’s my current implementation which is giving me an ERROR: MethodError: Cannot convert an object of type Float64 to an object of type Array{Float64,1}:

function Kmeans(X, k; max_iters = 300, tol = 1e-5)
    # Reshape 2D array to a 1D array with length of all training examples
    # where each example is of size (n, ) ie the new array is just a list of example array
    X_array_list = collect(eachrow(X))

    # Save some info on the incoming data
    N = length(X_array_list)  # Length of all training examples
    n = length(X_array_list[1])  # Length of a single training example
    distances = zeros(N)  # Empty vector for all training examples. Useful later

    # Step 1: Random initialization
    reps_centroids = [zeros(n) for grp = 1:k]  # Initiate centroids for each
    labels = rand(1:k, N)  # Randomly assign labels (between 1 to k) to all training examples

    J_previous = Inf

    for iter = 1:max_iters

        # Step 2: Update the representative centroids for each group
        for j = 1:k
            # get group indices for each group
            group_idx = [i for i = 1:N if labels[i] == j]

            # use group indices to locate each group
            reps_centroids[j] = mean(X_array_list[group_idx]);
        end;

        # Step 3: Update the group labels
        for i = 1:N
            # compute the distance between each example and the updated representative centroid
            nearest_rep_distance = [norm(X_array_list[i] - reps_centroids[x]) for x = 1:k]

            # update distances and label arrays with value and index of closest neighbour
            # findmin returns the min value & index location
            distances[i], labels[i] = findmin(nearest_rep_distance)
        end;

        # Step 4: Compute the clustering cost
        J = (norm(distances)^ 2) / N

        # Show progress and terminate if J stopped decreasing.
        println("Iteration ", iter, ": Jclust = ", J, ".")

        # Final Step 5: Check for convergence
        if iter > 1 && abs(J - J_previous) < (tol * J)
            # TODO: Calculate the sum of squares

            # Terminate algorithm with the assumption that K-means has converged
            return labels, reps_centroids
    
        end

        J_previous = J
    end

end

Kmeans(X, 3)

I have been scratching my head for 2 hours now as I can see where the error is coming from. What am I missing? Thanks in advance.

UPDATE: Bug has been fixed now :slight_smile:

Does the second version prints anything? Could you point the line of the function in which your code breaks? Providing an easier testable version that generates random data instead of loading with pycall might help others to test your code.

It breaks on reps_centroids[j] = mean(X[group_idx]);

I guess I failed to change the variable names :frowning:

It looks like reps_centroids is a vector of vectors. But you are setting elements of it to mean(...), which is a scalar.

I’m not sure exactly the goals of your code so it’s difficult to give you more concrete help. But I think that’s your problem. Solutions might be to initialize reps_centroids = [0 for grop = 1:k], or maybe you want your call to mean to actually return a vector of means. In which case you have to broadcast.

It’s been fixed now. The problem actually is that I failed to update calls to X_array_list after I changed the that variable name. And yes, the centroids is a vector of vectors and mean correctly assigns a vector for each group.