Dear all,
I would like to announce the availability of “BetaML” , the Beta Machine Learning toolkit, a package for Machine Learning algorithms and related utilities.
The toolkit is currently made of 4 modules. Perceptron includes the classical perceptron linear classifier, but also the non-linear kernel perceptron and the gradient-based Pegasus classifier. Nn implements easy-to-model Artificial Neural Networks (simple feed-forward only for the moment, but we plan to add support for convolutional layers, Recurrent Neural Network and LSTM ones). Note that automatic differentiation with Zygote
is optional, you can pass your own derivative of the activation function if you wish (common ones are provided). Clustering has algorithms such as kmeans, Kmedoids and Expectation-Maximisation based on Gaussian Mixture Models (GMM). As the EM algorithm supports partially missing observations (observations with missing data only on some dimensions), it is used as backbone algorithm for collaborative filtering (recommendation systems). Finally Utils is a module implementing common functions as scaling, one-hot encoding, various kernels and distance metrics.
BetaML most likely has value only didactically, as the approaches are the “vanilla” ones, i.e. the simplest possible ones, and GPU is not supported. For “serious” machine learning work in Julia I would suggest to use either Flux or Knet.
As the focus is mainly didactic, functions have pretty longer but more explicit names than usual… for example the Dense
layer is a " DenseLayer
"
, the RBF
kernel is " radialKernel
"
, etc.
That said, Julia is a relatively fast language and most hard job is done in multithreaded functions or using matrix operations whose underlying libraries may be multithreaded, so it is reasonably fast for small exploratory tasks. Also it is already very flexible. For example, one can implement its own layer as a subtype of the abstract type Layer
or its own optimisation algorithm as a subtype of OptimisationAlgorithm
or even specify its own distance metric in the Kmedoids algorithm…
This repository started from implementing in the Julia language the concepts taught in the MITX 6.86x - Machine Learning with Python: from Linear Models to Deep Learning course, and theoretical notes describing most of these algorithms are available at the companion repository GitHub - sylvaticus/MITx_6.86x: Notes of MITx 6.86x - Machine Learning with Python: from Linear Models to Deep Learning.
Cheers,
Antonello Lobianco, Bureau d’Economie Théorique et Appliquée of Nancy & AgroParisTech
References:
-
main repository: GitHub - sylvaticus/BetaML.jl: Beta Machine Learning Toolkit
-
documentation: Index · BetaML.jl Documentation
-
online runnable notebooks: https://sylvaticus.github.io/BetaML.jl/dev/Notebooks.html
(yep, the logo is inspired by a popular superhero…. the wish is that whenever we have a numerical problem, the Beta Machine Learning toolkit could come to the rescue with its superpowers! )
This is a full example of multi-class classification of the Sepal dataset:
# Load Modules
using BetaML.Nn, DelimitedFiles, Random, StatsPlots # Load the main module and ausiliary modules
Random.seed!(123); # Fix the random seed (to obtain reproducible results)
# Load the data
iris = readdlm(joinpath(dirname(Base.find_package("BetaML")),"..","test","data","iris.csv"),',',skipstart=1)
iris = iris[shuffle(axes(iris, 1)), :] # Shuffle the records, as they aren't by default
x = convert(Array{Float64,2}, iris[:,1:4])
y = map(x->Dict("setosa" => 1, "versicolor" => 2, "virginica" =>3)[x],iris[:, 5]) # Convert the target column to numbers
y_oh = oneHotEncoder(y) # Convert to One-hot representation (e.g. 2 => [0 1 0], 3 => [0 0 1])
# Split the data in training/testing sets
ntrain = Int64(round(size(x,1)*0.8))
xtrain = x[1:ntrain,:]
ytrain = y[1:ntrain]
ytrain_oh = y_oh[1:ntrain,:]
xtest = x[ntrain+1:end,:]
ytest = y[ntrain+1:end]
# Define the Artificial Neural Network model
l1 = DenseLayer(4,10,f=relu) # Activation function is ReLU
l2 = DenseLayer(10,3) # Activation function is identity by default
l3 = VectorFunctionLayer(3,3,f=softMax) # Add a (parameterless) layer whose activation function (softMax in this case) is defined to all its nodes at once
mynn = buildNetwork([l1,l2,l3],squaredCost,name="Multinomial logistic regression Model Sepal") # Build the NN and use the squared cost (aka MSE) as error function
# Training it (default to SGD)
res = train!(mynn,scale(xtrain),ytrain_oh,epochs=100,batchSize=6) # Use optAlg=SGD (Stochastic Gradient Descent) by default
# Test it
ŷtrain = predict(mynn,scale(xtrain)) # Note the scaling function
ŷtest = predict(mynn,scale(xtest))
trainAccuracy = accuracy(ŷtrain,ytrain,tol=1) # 0.983
testAccuracy = accuracy(ŷtest,ytest,tol=1) # 1.0
# Visualise results
testSize = size(ŷtest,1)
ŷtestChosen = [argmax(ŷtest[i,:]) for i in 1:testSize]
groupedbar([ytest ŷtestChosen], label=["ytest" "ŷtest (est)"], title="True vs estimated categories") # All records correctly labelled !
plot(0:res.epochs,res.ϵ_epochs, ylabel="epochs",xlabel="error",legend=nothing,title="Avg. error per epoch on the Sepal dataset")
PS: thanks to @kevbonham on topic 37198:
It ended up that writing tests, doc, getting CI and registration has been almost as time consuming that writing the library itself, but a very rewarding experience !