randomized stratified k-fold cross-validation in scikit-learn?

  • Last Update :
  • Techknowledgy :

This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.,Provides train/test indices to split data in train test sets.,sklearn.cross_validation.StratifiedKFold Examples using sklearn.cross_validation.StratifiedKFold ,Whether to shuffle each stratification of the data before splitting into batches.

>>> from sklearn
import cross_validation
   >>>
   X = np.array([
      [1, 2],
      [3, 4],
      [1, 2],
      [3, 4]
   ]) >>>
   y = np.array([0, 0, 1, 1]) >>>
   skf = cross_validation.StratifiedKFold(y, n_folds = 2) >>>
   len(skf)
2
   >>>
   print(skf)
sklearn.cross_validation.StratifiedKFold(labels = [0 0 1 1], n_folds = 2,
      shuffle = False, random_state = None) >>>
   for train_index, test_index in skf:
   ...print("TRAIN:", train_index, "TEST:", test_index)
   ...X_train, X_test = X[train_index], X[test_index]
   ...y_train, y_test = y[train_index], y[test_index]
TRAIN: [1 3] TEST: [0 2]
TRAIN: [0 2] TEST: [1 3]..automethod::__init__

Suggestion : 2

I thought I would post my solution in case it is useful to anyone else.

from collections
import defaultdict
import random
def strat_map(y):
   ""
"
Returns permuted indices that maintain class ""
"
smap = defaultdict(list)
for i, v in enumerate(y):
   smap[v].append(i)
for values in smap.values():
   random.shuffle(values)
y_map = np.zeros_like(y)
for i, v in enumerate(y):
   y_map[i] = smap[v].pop()
return y_map

# # # # # # # # # #
#Example Use
# # # # # # # # # #
skf = StratifiedKFold(y, nfolds)
sm = strat_map(y)
for test, train in skf:
   test, train = sm[test], sm[train]
#then cv as usual

# # # # # # #
#tests #
# # # # # # #
import numpy.random as rnd
for _ in range(100):
   y = np.array([0] * 10 + [1] * 20 + [3] * 10)
rnd.shuffle(y)
sm = strat_map(y)
shuffled = y[sm]
assert(sm != range(len(y))).any(), "did not shuffle"
assert(shuffled == y).all(), "classes not in right position"
assert(set(sm) == set(range(len(y)))), "missing indices"

for _ in range(100):
   nfolds = 10
skf = StratifiedKFold(y, nfolds)
sm = strat_map(y)
for test, train in skf:
   assert(sm[test] != test).any(), "did not shuffle"
assert(y[sm[test]] == y[test]).all(), "classes not in right position"

Here is my implementation of stratified shuffle split into training and testing set:

import numpy as np

def get_train_test_inds(y, train_proportion = 0.7):
   ''
'Generates indices, making random stratified split into training set and testing sets
with proportions train_proportion and(1 - train_proportion) of initial sample.
y is any iterable indicating classes of each observation in the sample.
Initial proportions of classes inside training and
test sets are preserved(stratified sampling).
''
'

y = np.array(y)
train_inds = np.zeros(len(y), dtype = bool)
test_inds = np.zeros(len(y), dtype = bool)
values = np.unique(y)
for value in values:
   value_inds = np.nonzero(y == value)[0]
np.random.shuffle(value_inds)
n = int(train_proportion * len(value_inds))

train_inds[value_inds[: n]] = True
test_inds[value_inds[n: ]] = True

return train_inds, test_inds

y = np.array([1, 1, 2, 2, 3, 3])
train_inds, test_inds = get_train_test_inds(y, train_proportion = 0.5)
print y[train_inds]
print y[test_inds]

This code outputs:

[1 2 3]
[1 2 3]

Suggestion : 3

Last Updated : 27 Apr, 2022

Output: 

List of possible accuracy: [0.9298245614035088, 0.9649122807017544, 0.9824561403508771, 1.0, 0.9649122807017544, 0.9649122807017544, 0.9824561403508771, 0.9473684210526315, 0.9473684210526315, 0.9821428571428571]

Maximum Accuracy That can be obtained from this model is: 100.0 %

   Minimum Accuracy That can be obtained from this model is: 92.98245614035088 %

   The overall Accuracy of this model is: 96.66353383458647 %

   The Standard Deviation is: 0.02097789213195869

Suggestion : 4

Now as we have our classifier trained on different folds, as mentioned earlier, we are going to check the performance of our model with the test data and we will try to understand how all the below matrices are significantly important for classification problems.   ,Next in the article, we will implement the Stratified K-Fold cross-validation and analyze its importance on several parameters. The below python code shows that how one can use the Stratified K Fold Cross-validation for a classification problem, after training our classifier the performance of the same will be evaluated against the following metrics:-,The F-score, F measure or F1 score is a measure of the test’s accuracy and it is calculated by the weighted average of Precision and Recall. Its value varies between 0 and 1 and the best value is 1 ,It is a performance measurement for the ML model specifically for classification problems where output can be two or more classes. So basically, it is a table of four different combinations of predicted and actual values. It is extremely useful for measuring Recall, Precision, Specificity, Accuracy, and, most importantly, curves like ROC AUC.    

Let’s start by importing all dependencies;    

import pandas as pd
from sklearn.model_selection
import StratifiedKFold
from sklearn.linear_model
import LogisticRegression

Load and take a look at the dataset;

dataset = pd.read_csv('/content/diabetes.csv')
dataset.head(10)

Now here we are using Logistic regression with a solver as newton-cg to avoid any convergence issue, and separate target variable and dataset as below;

model = LogisticRegression(solver = 'newton-cg')
x = dataset
y = dataset.Outcome

Try the complete code together;

dataset = pd.read_csv('/content/diabetes.csv')
skf = StratifiedKFold(n_splits = 10)
model = LogisticRegression(solver = 'newton-cg')
x = dataset
y = dataset.Outcome

def training(train, test, fold_no):
   x_train = train.drop(['Outcome'], axis = 1)
y_train = train.Outcome
x_test = test.drop(['Outcome'], axis = 1)
y_test = test.Outcome
model.fit(x_train, y_train)
score = model.score(x_test, y_test)
print('For Fold {} the accuracy is {}'.format(str(fold_no), score))

fold_no = 1
for train_index, test_index in skf.split(x, y):
   train = dataset.iloc[train_index,: ]
test = dataset.iloc[test_index,: ]
training(train, test, fold_no)
fold_no += 1

Let’s plot the confusion matrix for our model;

from sklearn.metrics
import plot_confusion_matrix
plot_confusion_matrix(model, X_train, y_train)

Suggestion : 5

Scikit-learn's StratifiedKFold will randomly sample data from each class into N folds (default of 5) that can be used to perform cross-validation during machine learning training.,This tutorial explains how to generate K-folds for cross-validation using scikit-learn for evaluation of machine learning models with out of sample data using stratified sampling. With stratified sampling, the relative proportions of classes from the overall dataset is maintained in each fold.,Open source data transformations, without having to write SQL. Choose from a wide selection of predefined transforms that can be exported to DBT or native SQL.,During this tutorial you will work with an OpenML dataset to predict who pays for internet with 10108 observations and 69 columns.

from sklearn.datasets
import fetch_openml
import pandas as pd
from sklearn.model_selection
import StratifiedKFold
data = fetch_openml(name = 'kdd_internet_usage')
df = data.frame
df.info()
target = 'Who_Pays_for_Access_Work'
y = df[target]
X = data.data.drop(columns = ['Who_Pays_for_Access_Dont_Know',
   'Who_Pays_for_Access_Other', 'Who_Pays_for_Access_Parents',
   'Who_Pays_for_Access_School', 'Who_Pays_for_Access_Self'
])
skf = StratifiedKFold(n_splits = 10, random_state = 1066, shuffle = True)
for train_index, test_index in skf.split(X, y):
   print("Train:", train_index, "Test:", test_index)
X_train = X.iloc[train_index,: ]
y_train = y[train_index]
X_test = X.iloc[test_index,: ]
y_test = y[test_index]
Train: [0 1 2...10105 10106 10107] Test: [6 12 17...10031 10066 10097]
Train: [0 1 2...10104 10106 10107] Test: [8 34 35...10090 10099 10105]
Train: [0 2 3...10105 10106 10107] Test: [1 30 31...10045 10057 10060]
Train: [0 1 2...10105 10106 10107] Test: [15 22 23...10080 10087 10092]
Train: [1 2 3...10105 10106 10107] Test: [0 4 9...10069 10076 10088]
Train: [0 1 2...10104 10105 10106] Test: [5 11 14...10089 10095 10107]
Train: [0 1 2...10105 10106 10107] Test: [18 28 36...10054 10094 10101]
Train: [0 1 2...10104 10105 10107] Test: [3 7 19...10096 10102 10106]
Train: [0 1 2...10105 10106 10107] Test: [10 41 54...10098 10100 10104]
Train: [0 1 3...10105 10106 10107] Test: [2 46 57...10067 10081 10103]