how to add oversampling/undersampling procedure in scikit's pipeline?

  • Last Update :
  • Techknowledgy :
[1]:
import matplotlib.pyplot as plt
from sklearn.datasets
import make_classification
import numpy as np
import pandas as pd

plt.style.use('ggplot')
np.random.seed(37)

X, y = make_classification( ** {
   'n_samples': 5000,
   'n_features': 5,
   'n_classes': 2,
   'random_state': 37
})

columns = [f 'x{i}'
   for i in range(X.shape[1])
] + ['y']
df = pd.DataFrame(np.hstack([X, y.reshape(-1, 1)]), columns = columns)

print(df.shape)
(5000, 6)
[2]:
df.head()
[3]:

Suggestion : 2

Let’s first create an imbalanced dataset and split in to two sets., Benchmark over-sampling methods in a face recognition task , Fitting model on imbalanced datasets and how to fight bias ,Now, we can finally create a pipeline to specify in which order the different transformers and samplers should be executed before to provide the data to the final classifier.

# Authors: Christos Aridas
# Guillaume Lemaitre <g.lemaitre58@gmail.com>
   # License: MIT
print(__doc__)
from sklearn.datasets
import make_classification
from sklearn.model_selection
import train_test_split

X, y = make_classification(
   n_classes = 2,
   class_sep = 1.25,
   weights = [0.3, 0.7],
   n_informative = 3,
   n_redundant = 1,
   flip_y = 0,
   n_features = 5,
   n_clusters_per_class = 1,
   n_samples = 5000,
   random_state = 10,
)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 42)
from sklearn.decomposition
import PCA
from sklearn.neighbors
import KNeighborsClassifier
from imblearn.under_sampling
import EditedNearestNeighbours
from imblearn.over_sampling
import SMOTE

pca = PCA(n_components = 2)
enn = EditedNearestNeighbours()
smote = SMOTE(random_state = 0)
knn = KNeighborsClassifier(n_neighbors = 1)
from imblearn.pipeline
import make_pipeline

model = make_pipeline(pca, enn, smote, knn)
from sklearn.metrics
import classification_report

model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Suggestion : 3

This extends sampler objects in scikit-learn. The sampler objects have a sample method which returns resample data and resample targets.,So we’re going to use imbalance learn, which is an extension of the scikit-learn API that basically allows us to resample.,I don’t think it’s actually used commonly in practice like the random oversampling and random undersampling. In particular, the undersample is great because it makes everything much faster.,The complement of doing random sampling of the data is random oversampling of data. So in random oversampling, we do the opposite. We basically resample the training dataset so that the minority class has the same number of samples as the majority class.

data = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(
   data.data, data.target, stratify = data.target, random_state = 0)

lr = LogisticRegression().fit(X_train, y_train)
y_pred = lr.predict(X_test)

classification_report(y_test, y_pred)
          precision recall f1 - score support
          0 0.91 0.92 0.92 53
          1 0.96 0.94 0.95 90
          avg / total 0.94 0.94 0.94 143
y_pred = lr.predict_proba(X_test)[: , 1] > .85

classification_report(y_test, y_pred)
          precision recall f1 - score support
          0 0.84 1.00 0.91 53
          1 1.00 0.89 0.94 90
          avg / total 0.94 0.93 0.93 143
from sklearn.datasets
import fetch_openml
# # mammography https: //www.openml.org/d/310
   data = fetch_openml('mammography', as_frame = True)
X, y = data.data, data.target
X.shape
y.value_counts()