logistic regression and gridsearchcv using python sklearn

  • Last Update :
  • Techknowledgy :

You end up with the error with precision because some of your penalization is too strong for this model, if you check the results, you get 0 for f1 score when C = 0.001 and C = 0.01

res = pd.DataFrame(logreg_cv.cv_results_)
res.iloc[: , res.columns.str.contains("split[0-9]_test_score|params", regex = True)]

params split0_test_score split1_test_score split2_test_score
0 {
   'C': 0.001,
   'penalty': 'l2'
}
0.000000 0.000000 0.000000
1 {
   'C': 0.01,
   'penalty': 'l2'
}
0.000000 0.000000 0.000000
2 {
   'C': 0.1,
   'penalty': 'l2'
}
0.973568 0.952607 0.952174
3 {
   'C': 1.0,
   'penalty': 'l2'
}
0.863934 0.851064 0.836449
4 {
   'C': 10.0,
   'penalty': 'l2'
}
0.811634 0.769547 0.787838
5 {
   'C': 100.0,
   'penalty': 'l2'
}
0.789826 0.762162 0.773438
6 {
   'C': 1000.0,
   'penalty': 'l2'
}
0.781003 0.750000 0.763871

You can check this:

lr = LogisticRegression(C = 0.01).fit(X_train_vectors_tfidf, y_train)
np.unique(lr.predict(X_train_vectors_tfidf))
array([0])

And that the probabilities predicted drift towards the intercept:

# expected probability
np.exp(lr.intercept_) / (1 + np.exp(lr.intercept_))
array([0.41764462])

lr.predict_proba(X_train_vectors_tfidf)

array([
   [0.58732636, 0.41267364],
   [0.57074279, 0.42925721],
   [0.57219143, 0.42780857],
   ...,
   [0.57215605, 0.42784395],
   [0.56988186, 0.43011814],
   [0.58966184, 0.41033816]
])

Suggestion : 2

In this recipe how to optimize hyper parameters of a Logistic Regression model using Grid Search and implementation of various functions is given using Python. Last Updated: 23 Apr 2022 ,Optimize Logistic Regression Hyper Parameters,So this recipe is a short example of how to use Grid Search and get the best set of hyperparameters.,estimator: In this we have to pass the models or functions on which we want to use GridSearchCV

Making an object clf_GS for GridSearchCV and fitting the dataset i.e X and y clf_GS = GridSearchCV(pipe, parameters) clf_GS.fit(X, y) Now we are using print statements to print the results. It will give the values of hyperparameters as a result. As an output we get:

Best Penalty: l1
Best C: 109.85411419875572
Best Number Of Components: 13

LogisticRegression(C = 109.85411419875572, class_weight = None, dual = False,
   fit_intercept = True, intercept_scaling = 1, max_iter = 100,
   multi_class = 'warn', n_jobs = None, penalty = 'l1', random_state = None,
   solver = 'warn', tol = 0.0001, verbose = 0, warm_start = False)

Suggestion : 3

LogisticRegression (Logistic regression): Grid search is applied to select the most appropriate value of inverse regularization parameter, C. For this case, you could as well have used validation_curve (sklearn.model_selection) to select the most appropriate value of C.,SVC (Support vector classifier): Grid search is applied to select the most appropriate parameters such as kernel (linear, rbf) and the values of gamma and C.,RandomForestClassifier (Random forest): Grid search is applied on RandomForestClassifier to select the most appropriate value of hyper parameters such as max_depth and max_features.,In this section, you will see Python Sklearn code example of Grid Search algorithm applied to different estimators such as RandomForestClassifier, LogisticRegression and SVC. Pay attention to some of the following in the code given below:

import pandas as pd
import numpy as np
from sklearn
import datasets
from sklearn.preprocessing
import StandardScaler
from sklearn.model_selection
import train_test_split
from sklearn.pipeline
import make_pipeline
from sklearn.model_selection
import GridSearchCV
from sklearn.svm
import SVC
from sklearn.linear_model
import LogisticRegression
from sklearn.ensemble
import RandomForestClassifier

bc = datasets.load_breast_cancer()
X = bc.data
y = bc.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1, stratify = y)
pipelineRFC = make_pipeline(StandardScaler(), RandomForestClassifier(criterion = 'gini', random_state = 1))
#
# Create the parameter grid
#
param_grid_rfc = [{
   'randomforestclassifier__max_depth': [2, 3, 4],
   'randomforestclassifier__max_features': [2, 3, 4, 5, 6]
}]
#
# Create an instance of GridSearch Cross - validation estimator
#
gsRFC = GridSearchCV(estimator = pipelineRFC,
   param_grid = param_grid_rfc,
   scoring = 'accuracy',
   cv = 10,
   refit = True,
   n_jobs = 1)
#
# Train the RandomForestClassifier
#
gsRFC = gsRFC.fit(X_train, y_train)
#
# Print the training score of the best model
#
print(gsRFC.best_score_)
#
# Print the model parameters of the best model
#
print(gsRFC.best_params_)
#
# Print the test score of the best model
#
clfRFC = gsRFC.best_estimator_
print('Test accuracy: %.3f' % clfRFC.score(X_test, y_test))
pipelineSVC = make_pipeline(StandardScaler(), SVC(random_state = 1))
#
# Create the parameter grid
#
param_grid_svc = [{
      'svc__C': [0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 10.0],
      'svc__kernel': ['linear']
   },
   {
      'svc__C': [0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 10.0],
      'svc__gamma': [0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 10.0],
      'svc__kernel': ['rbf']
   }
]
#
# Create an instance of GridSearch Cross - validation estimator
#
gsSVC = GridSearchCV(estimator = pipelineSVC,
   param_grid = param_grid_svc,
   scoring = 'accuracy',
   cv = 10,
   refit = True,
   n_jobs = 1)
#
# Train the SVM classifier
#
gsSVC.fit(X_train, y_train)
#
# Print the training score of the best model
#
print(gsSVC.best_score_)
#
# Print the model parameters of the best model
#
print(gsSVC.best_params_)
#
# Print the model score on the test data using GridSearchCV score method
#
print('Test accuracy: %.3f' % gsSVC.score(X_test, y_test))
#
# Print the model score on the test data using Best estimator instance
#
clfSVC = gsSVC.best_estimator_
print('Test accuracy: %.3f' % clfSVC.score(X_test, y_test))
pipelineLR = make_pipeline(StandardScaler(), LogisticRegression(random_state = 1, penalty = 'l2', solver = 'lbfgs'))
#
# Create the parameter grid
#
param_grid_lr = [{
   'logisticregression__C': [0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 10.0],
}]
#
# Create an instance of GridSearch Cross - validation estimator
#
gsLR = GridSearchCV(estimator = pipelineLR,
   param_grid = param_grid_lr,
   scoring = 'accuracy',
   cv = 10,
   refit = True,
   n_jobs = 1)
#
# Train the LogisticRegression Classifier
#
gsLR = gsLR.fit(X_train, y_train)
#
# Print the training score of the best model
#
print(gsLR.best_score_)
#
# Print the model parameters of the best model
#
print(gsLR.best_params_)
#
# Print the test score of the best model
#
clfLR = gsLR.best_estimator_
print('Test accuracy: %.3f' % clfLR.score(X_test, y_test))