regularization parameter and iteration of sgdclassifier in scikit-learn

  • Last Update :
  • Techknowledgy :

SGD requires a number of hyperparameters such as the regularization parameter and the number of iterations.,where is a loss function that measures model (mis)fit and is a regularization term (aka penalty) that penalizes model complexity; is a non-negative hyperparameter.,The Huber and epsilon-insensitive loss functions can be used for robust regression. The width of the insensitive region has to be specified via the parameter epsilon. This parameter depends on the scale of the target variables.,Empirically, we found that SGD converges after observing approx. 10^6 training samples. Thus, a reasonable first guess for the number of iterations is n_iter = np.ceil(10**6 / n), where n is the size of the training set.

>>> from sklearn.linear_model
import SGDClassifier
   >>>
   X = [
      [0., 0.],
      [1., 1.]
   ] >>>
   y = [0, 1] >>>
   clf = SGDClassifier(loss = "hinge", penalty = "l2") >>>
   clf.fit(X, y)
SGDClassifier(alpha = 0.0001, average = False, class_weight = None, epsilon = 0.1,
   eta0 = 0.0, fit_intercept = True, l1_ratio = 0.15,
   learning_rate = 'optimal', loss = 'hinge', n_iter = 5, n_jobs = 1,
   penalty = 'l2', power_t = 0.5, random_state = None, shuffle = True,
   verbose = 0, warm_start = False)
>>> clf.predict([
   [2., 2.]
])
array([1])
>>> clf.coef_
array([
   [9.9..., 9.9...]
])
>>> clf.intercept_
array([-9.9...])
>>> clf.decision_function([
   [2., 2.]
])
array([29.6...])
>>> clf = SGDClassifier(loss = "log").fit(X, y) >>>
   clf.predict_proba([
      [1., 1.]
   ])
array([
   [0.00..., 0.99...]
])

Suggestion : 2

Alpha, the constant that multiplies the regularization term, is the tuning parameter that decides how much we want to penalize the model. The default value is 0.0001.,Stochastic Gradient Descent (SGD) classifier basically implements a plain SGD learning routine supporting various loss functions and penalties for classification. Scikit-learn provides SGDClassifier module to implement SGD classification.,Stochastic Gradient Descent (SGD) regressor basically implements a plain SGD learning routine supporting various loss functions and penalties to fit linear regression models. Scikit-learn provides SGDRegressor module to implement SGD regression.,Followings table consist the parameters used by SGDClassifier module −

Following Python script uses SGDClassifier linear model −

import numpy as np
from sklearn
import linear_model
X = np.array([
   [-1, -1],
   [-2, -1],
   [1, 1],
   [2, 1]
])
Y = np.array([1, 1, 2, 2])
SGDClf = linear_model.SGDClassifier(max_iter = 1000, tol = 1e-3, penalty = "elasticnet")
SGDClf.fit(X, Y)

Output

SGDClassifier(
   alpha = 0.0001, average = False, class_weight = None,
   early_stopping = False, epsilon = 0.1, eta0 = 0.0, fit_intercept = True,
   l1_ratio = 0.15, learning_rate = 'optimal', loss = 'hinge', max_iter = 1000,
   n_iter = None, n_iter_no_change = 5, n_jobs = None, penalty = 'elasticnet',
   power_t = 0.5, random_state = None, shuffle = True, tol = 0.001,
   validation_fraction = 0.1, verbose = 0, warm_start = False
)

Now, once fitted, the model can predict new values as follows −

SGDClf.predict([
   [2., 2.]
])

Similarly, we can get the value of intercept with the help of following python script −

SGDClf.intercept_

We can get the signed distance to the hyperplane by using SGDClassifier.decision_function as used in the following python script −

SGDClf.decision_function([
   [2., 2.]
])

Suggestion : 3

Get Hands-On Machine Learning with Scikit-Learn and TensorFlow now with the O’Reilly learning platform.,Get full access to Hands-On Machine Learning with Scikit-Learn and TensorFlow and 60K+ other titles, with free 10-day trial of O'Reilly.,Gradient Descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.,Implement Batch Gradient Descent with early stopping for Softmax Regression (without using Scikit-Learn).

Let’s generate some linear-looking data to test this equation on (Figure 4-1):

import numpy as np

X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

Now let’s compute using the Normal Equation. We will use the inv() function from NumPy’s Linear Algebra module (np.linalg) to compute the inverse of a matrix, and the dot() method for matrix multiplication:

X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

The actual function that we used to generate the data is y = 4 + 3x1 + Gaussian noise. Let’s see what the equation found:

>>> theta_best
array([
   [4.21509616],
   [2.77011339]
])

Let’s plot this model’s predictions (Figure 4-2):

plt.plot(X_new, y_predict, "r-")
plt.plot(X, y, "b.")
plt.axis([0, 2, 0, 15])
plt.show()

Performing linear regression using Scikit-Learn is quite simple:3

>>> from sklearn.linear_model
import LinearRegression
   >>>
   lin_reg = LinearRegression() >>>
   lin_reg.fit(X, y) >>>
   lin_reg.intercept_, lin_reg.coef_(array([4.21509616]), array([
      [2.77011339]
   ])) >>>
   lin_reg.predict(X_new)
array([
   [4.21509616],
   [9.75532293]
])

Let’s look at a quick implementation of this algorithm:

eta = 0.1 # learning rate
n_iterations = 1000
m = 100

theta = np.random.randn(2, 1) # random initialization

for iteration in range(n_iterations):
   gradients = 2 / m * X_b.T.dot(X_b.dot(theta) - y)
theta = theta - eta * gradients

That wasn’t too hard! Let’s look at the resulting theta:

>>> theta
array([
   [4.21509616],
   [2.77011339]
])

This code implements Stochastic Gradient Descent using a simple learning schedule:

n_epochs = 50
t0, t1 = 5, 50 # learning schedule hyperparameters

def learning_schedule(t):
   return t0 / (t + t1)

theta = np.random.randn(2, 1) # random initialization

for epoch in range(n_epochs):
   for i in range(m):
   random_index = np.random.randint(m)
xi = X_b[random_index: random_index + 1]
yi = y[random_index: random_index + 1]
gradients = 2 * xi.T.dot(xi.dot(theta) - yi)
eta = learning_schedule(epoch * m + i)
theta = theta - eta * gradients

By convention we iterate by rounds of m iterations; each round is called an epoch. While the Batch Gradient Descent code iterated 1,000 times through the whole training set, this code goes through the training set only 50 times and reaches a fairly good solution:

>>> theta
array([
   [4.21076011],
   [2.74856079]
])

To perform Linear Regression using SGD with Scikit-Learn, you can use the SGDRegressor class, which defaults to optimizing the squared error cost function. The following code runs 50 epochs, starting with a learning rate of 0.1 (eta0=0.1), using the default learning schedule (different from the preceding one), and it does not use any regularization (penalty=None; more details on this shortly):

from sklearn.linear_model
import SGDRegressor
sgd_reg = SGDRegressor(max_iter = 50, penalty = None, eta0 = 0.1)
sgd_reg.fit(X, y.ravel())

Let’s look at an example. First, let’s generate some nonlinear data, based on a simple quadratic equation9 (plus some noise; see Figure 4-12):

m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X ** 2 + X + 2 + np.random.randn(m, 1)

Clearly, a straight line will never fit this data properly. So let’s use Scikit-Learn’s PolynomialFeatures class to transform our training data, adding the square (2nd-degree polynomial) of each feature in the training set as new features (in this case there is just one feature):

>>> from sklearn.preprocessing
import PolynomialFeatures
   >>>
   poly_features = PolynomialFeatures(degree = 2, include_bias = False) >>>
   X_poly = poly_features.fit_transform(X) >>>
   X[0]
array([-0.75275929]) >>>
   X_poly[0]
array([-0.75275929, 0.56664654])

X_poly now contains the original feature of X plus the square of this feature. Now you can fit a LinearRegression model to this extended training data (Figure 4-13):

>>> lin_reg = LinearRegression() >>>
   lin_reg.fit(X_poly, y) >>>
   lin_reg.intercept_, lin_reg.coef_(array([1.78134581]), array([
      [0.93366893, 0.56456263]
   ]))

Another way is to look at the learning curves: these are plots of the model’s performance on the training set and the validation set as a function of the training set size (or the training iteration). To generate the plots, simply train the model several times on different sized subsets of the training set. The following code defines a function that plots the learning curves of a model given some training data:

from sklearn.metrics
import mean_squared_error
from sklearn.model_selection
import train_test_split

def plot_learning_curves(model, X, y):
   X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2)
train_errors, val_errors = [], []
for m in range(1, len(X_train)):
   model.fit(X_train[: m], y_train[: m])
y_train_predict = model.predict(X_train[: m])
y_val_predict = model.predict(X_val)
train_errors.append(mean_squared_error(y_train[: m], y_train_predict))
val_errors.append(mean_squared_error(y_val, y_val_predict))
plt.plot(np.sqrt(train_errors), "r-+", linewidth = 2, label = "train")
plt.plot(np.sqrt(val_errors), "b-", linewidth = 3, label = "val")

Let’s look at the learning curves of the plain Linear Regression model (a straight line; Figure 4-15):

lin_reg = LinearRegression()
plot_learning_curves(lin_reg, X, y)

Now let’s look at the learning curves of a 10th-degree polynomial model on the same data (Figure 4-16):

from sklearn.pipeline
import Pipeline

polynomial_regression = Pipeline([
   ("poly_features", PolynomialFeatures(degree = 10, include_bias = False)),
   ("lin_reg", LinearRegression()),
])

plot_learning_curves(polynomial_regression, X, y)

Here is how to perform Ridge Regression with Scikit-Learn using a closed-form solution (a variant of Equation 4-9 using a matrix factorization technique by André-Louis Cholesky):

>>> from sklearn.linear_model
import Ridge
   >>>
   ridge_reg = Ridge(alpha = 1, solver = "cholesky") >>>
   ridge_reg.fit(X, y) >>>
   ridge_reg.predict([
      [1.5]
   ])
array([
   [1.55071465]
])

And using Stochastic Gradient Descent:14

>>> sgd_reg = SGDRegressor(penalty = "l2") >>>
   sgd_reg.fit(X, y.ravel()) >>>
   sgd_reg.predict([
      [1.5]
   ])
array([1.13500145])

Suggestion : 4

Modifying your code quick and dirty I get:

# Added n_iter here
params = [{}, {
   "loss": "log",
   "penalty": "l2",
   'n_iter': 1000
}]

for param, Model in zip(params, Models):
   total = 0
for train_indices, test_indices in kf:
   train_X = X[train_indices,: ];
train_Y = Y[train_indices]
test_X = X[test_indices,: ];
test_Y = Y[test_indices]
reg = Model( ** param)
reg.fit(train_X, train_Y)
predictions = reg.predict(test_X)
total += accuracy_score(test_Y, predictions)

accuracy = total / numFolds
print "Accuracy score of {0}: {1}".format(Model.__name__, accuracy)

Accuracy score of LogisticRegression: 0.96
Accuracy score of SGDClassifier: 0.96