Hi everyone,
I am currently stuck trying to port an existing Julia code. Basically a code that builds a simple implementation of the K-means clustering algorithm. Here’s the testing data generated from sklearn:
# Replace python environment to suit your needs
ENV["PYTHON"] = "YOUR-PYTHON-ENV"
Pkg.build("PyCall") # Build PyCall to suit the specified Python env
using PyCall
using Statistics
using LinearAlgebra
using Plots
# import sklearn datasets
data = pyimport("sklearn.datasets")|
X, y = data.make_blobs(n_samples=100000, n_features=3, centers=3, cluster_std=0.9, random_state=80)
The original Julia (which works well) is as follows:
function kmeans(x, k; maxiters = 100, tol = 1e-5)
x = collect(eachrow(x))
N = length(x)
n = length(x[1])
distances = zeros(N) # used to store the distance of each
# point to the nearest representative.
reps = [zeros(n) for j=1:k] # used to store representatives.
# ’assignment’ is an array of N integers between 1 and k.
# The initial assignment is chosen randomly.
assignment = [ rand(1:k) for i in 1:N ]
Jprevious = Inf # used in stopping condition
for iter = 1:maxiters
# Cluster j representative is average of points in cluster j.
for j = 1:k
group = [i for i=1:N if assignment[i] == j]
reps[j] = sum(x[group]) / length(group);
end;
# For each x[i], find distance to the nearest reprepresentative
# and its group index
for i = 1:N
(distances[i], assignment[i]) = findmin([norm(x[i] - reps[j]) for j = 1:k])
end;
# Compute clustering objective.
J = norm(distances)^2 / N
# Show progress and terminate if J stopped decreasing.
println("Iteration ", iter, ": Jclust = ", J, ".")
if iter > 1 && abs(J - Jprevious) < tol * J
return assignment, reps
end
Jprevious = J
end
end
kmeans(X, 3)
I am trying to learn port this with some minor tweaks. Here’s my current implementation which is giving me an ERROR: MethodError: Cannot
convert an object of type Float64 to an object of type Array{Float64,1}
:
function Kmeans(X, k; max_iters = 300, tol = 1e-5)
# Reshape 2D array to a 1D array with length of all training examples
# where each example is of size (n, ) ie the new array is just a list of example array
X_array_list = collect(eachrow(X))
# Save some info on the incoming data
N = length(X_array_list) # Length of all training examples
n = length(X_array_list[1]) # Length of a single training example
distances = zeros(N) # Empty vector for all training examples. Useful later
# Step 1: Random initialization
reps_centroids = [zeros(n) for grp = 1:k] # Initiate centroids for each
labels = rand(1:k, N) # Randomly assign labels (between 1 to k) to all training examples
J_previous = Inf
for iter = 1:max_iters
# Step 2: Update the representative centroids for each group
for j = 1:k
# get group indices for each group
group_idx = [i for i = 1:N if labels[i] == j]
# use group indices to locate each group
reps_centroids[j] = mean(X_array_list[group_idx]);
end;
# Step 3: Update the group labels
for i = 1:N
# compute the distance between each example and the updated representative centroid
nearest_rep_distance = [norm(X_array_list[i] - reps_centroids[x]) for x = 1:k]
# update distances and label arrays with value and index of closest neighbour
# findmin returns the min value & index location
distances[i], labels[i] = findmin(nearest_rep_distance)
end;
# Step 4: Compute the clustering cost
J = (norm(distances)^ 2) / N
# Show progress and terminate if J stopped decreasing.
println("Iteration ", iter, ": Jclust = ", J, ".")
# Final Step 5: Check for convergence
if iter > 1 && abs(J - J_previous) < (tol * J)
# TODO: Calculate the sum of squares
# Terminate algorithm with the assumption that K-means has converged
return labels, reps_centroids
end
J_previous = J
end
end
Kmeans(X, 3)
I have been scratching my head for 2 hours now as I can see where the error is coming from. What am I missing? Thanks in advance.
UPDATE: Bug has been fixed now