Ok, so let’s imagine that I’m a newcomer in a field of machine learning, and the very first task i want to do in Julia is something like knn-classifier. Nice! So let’s try to do it step by step, using all those cool libraries!
using RDatasets
data = dataset("datasets", "iris")
X = data[:, 1:4]
y = data[:, 5]
So far so good! Now I want to plot my data: something like scatter-plot, where points are colorized according to species.
Of course i can do one scatter-plot using Plots package… Can I?
The first intention must be something like:
plot(X[:, 1:2], color = y)
Oh no: my X is DataFrame, which is not accepted by plot function. OK then, maybe i should convert it first to Matrix (why should i do that? Is it absolutely necessary? Why?)
plot(convert(Matrix, X[:, 1 : 2]), color = y)
It does not work either, because for some reason categorical variable as color is not allowed.
OK then, you are trying to make some labels from this y, and you can end up with something like:
using MLDataUtils
ylabs = convertlabel(1 : nlabel(y),y)
Xmat = Matrix(X[:, 1 : 2])
scatter(Xmat, color = ylabs)
It works!.. But not as expected. One last thing!
scatter(Xmat[:, 1], Xmat[:, 2], color = ylab)
Now it works as expected!
OK, now you can say: well, you should use recipes for these things and all these nice macros!
Fine! Can I produce pair plots like in R just in one simple command? No! Only corrplot is somewhat near to the desired output, but you can not use different colors for different groups. So go read all of the docs and manuals man, master those scary macros and eventually write your own recipe.
But now you just want to see the data, is it really so hard? So you maybe end up with somewhat like this:
function pairplot(X, y)
colnames = String.(names(X))
classes = nlabel(y)
n = size(X)[2]
ylab = convertlabel(1 : classes, y)
plotter = Matrix{Any}(undef, n,n)
# with labels
plotter[1, 1] = histogram(X[:, 1],
ylabel = colnames[1],
title = colnames[1])
for j in 2 : n
Xi = X[:, 1]
Xj = X[:, j]
ylabel = colnames[j]
plotter[1, j] = scatter(Xi, Xj,
markercolor = ylab,
ylabel = ylabel)
plotter[j, 1] = plot(title = colnames[j])
end
# diagonal
for i in 2 : n
plotter[i, i] = histogram(X[:, i])
end
# upper diagonal
for i in 1 : n
for j in 2 : (i - 1)
Xi = X[:, i]
Xj = X[:, j]
plotter[i, j] = plot()
end
end
# lower diagonal
for i in 2 : n
for j in (i + 1) : n
Xi = X[:, i]
Xj = X[:, j]
plotter[i, j] = scatter(Xi, Xj,
markercolor = ylab)
end
end
plot(plotter...,
layout=grid(n,n),
legend = false)
end
which produce something like:
I know that this is ugly ad-hoc solution to the problem. But it works as expected and produces some desired output.
Now you just trying to implement knn-classifier, using for example NearestNeighbours as a basis. First you should split your data into two parts… Wait, is there any package for train-test split? Such a basic thing to do, I’m 100% sure that there should be some… MLDataUtils looks nice! Let’s try it out!
using MLDataUtils
splitobs((X, y), at = 0.8)
This is the obvious use-case, X is DataFrame, y is labels. But… it’s not working! It will work only with that form:
splitobs((Matrix(X)', y), at = 0.8)
And so on… each function has its own distracting “properties”, and 99% of the workflow consists of endless converting between DataFrames, Matrices, transposed matrices. Categorical variables support is very very weak, every single machine learning package for some reason reimplement some form of one-hot and other encoding. Even the basic “describe” function is meaningless in terms of categorical variables. For instance it returns min and max for categorical column. Is there any sense? More fruitful information would be number of observations for each of the categories, for example.
Visualisation tools for trained models are also somewhat raw. For example, how one can inspect decision regions for classifier? You have to implement it by yourself, using other packages.