 # Code to train a Support Vector Machine?

Hello,
In the MITx 6.86x “Machine Learning with Python-From Linear Models to Deep Learning” course we has been presented with a recitation that uses scikit-learn to train a SVM model (code follows).

In particular, the key functions are `model =linear_model.SGDClassifier(loss='hinge', penalty='l2', alpha=alpha[i])` that builds the model, and `score = ms.cross_val_score(model, X, y, cv=5)` that automatically partitions the training set in 4 parts to be used for SVM training (using gradient descendent, or so they told us), and one part, different each time, to be used as validation set, returning a vector of scores.

I would like to port it in Julia, but before coding inefficient gradient descent and training methods, I wander if efficient versions of these algorithms already exist.

Also, how could I access the dataset? PyCall?

Please consider I know nothing of Machine Learning (this is just the first unit…)

Here is the code of the recitation in Python:

``````#!/usr/bin/env python
# coding: utf-8

# In[ ]:

### Scikit-Learn: https://scikit-learn.org/stable/

# In:

# Imports
import numpy as np
import matplotlib.pyplot as plt # for plotting
import seaborn as sns # for plotting
from sklearn import datasets
from sklearn import preprocessing
from sklearn import linear_model
from sklearn import model_selection as ms

# In:

y = cancer_data.target # Training labels ('malignant = 0', 'benign = 1')
X = cancer_data.data # 30 attributes; https://scikit-learn.org/stable/datasets/index.html#breast-cancer-dataset
X = preprocessing.scale(X) # scale each data attribute to zero-mean and unit variance

# In:

# Plot the first 2 attributes of training points
sns.scatterplot(X[:, 0], X[:, 1], hue=y)
plt.ylabel('Tumor Texture')
plt.grid(True)
plt.show()

# In:

alpha = np.arange(1e-15,1,0.005) # Range of hyperparameter values 1E-15 to 1 by 0.005
val_scores = np.zeros((len(alpha),1)) # Initialize validation score for each alpha value

for i in range(len(alpha)): # for each alpha value
# Set up SVM with hinge loss and l2 norm regularization
model = linear_model.SGDClassifier(loss='hinge', penalty='l2', alpha=alpha[i])
# Calculate cross validation scores for 5-fold cross-validation
score = ms.cross_val_score(model, X, y, cv=5)
val_scores[i] = score.mean() # Calculate mean of the 5 scores

# In:

# Plot how cross-validation score changes with alpha
plt.plot(alpha,val_scores)
plt.xlim(0,1)
plt.xlabel('alpha')
plt.ylabel('Mean Cross-Validation Accuracy')
plt.grid(True)
plt.show()

# In:

# Determine the alpha that maximizes the cross-validation score
ind = np.argmax(val_scores)
alpha_star = alpha[ind]
print('alpha_star =', alpha_star)

plt.plot(alpha,val_scores)
plt.plot(np.ones(11)*alpha_star,np.arange(0,1.1,0.1),'--r')
plt.xlim(0,1)
plt.ylim(0.94,0.98)
plt.xlabel('alpha')
plt.ylabel('Mean Cross-Validation Accuracy')
plt.grid(True)
plt.show()

# In:

# Train model with alpha_star
model_star = linear_model.SGDClassifier(loss='hinge', penalty='l2', alpha=alpha_star)
model_trained = model_star.fit(X,y)
print('Training Accuracy =', model_trained.score(X,y))
# Training Accuracy = 0.9806678383128296

# In:

# Plot decision boundary of trained model
slope = model_trained.coef_[0,1]/-model_trained.coef_[0,0]
x1 = np.arange(-10,10,0.5)
y1 = slope*x1
sns.scatterplot(X[:, 0], X[:, 1], hue=y)
plt.plot(x1,y1,'--k')
plt.xlim(-4,4)
plt.ylim(-6,6)