how to override sklearn's tsne to be used in a pipeline function?

  • Last Update :
  • Techknowledgy :

Generally, for PCA, I would do the following:

make_pipeline(PCA(),
   LinearRegression())

However, when I tried this:

make_pipeline(TSNE(),
   LinearRegression())

I would get an error, saying that it does not have a transform() method, and it couldn't use the fit_transform() method. So, now I'm trying to create a custom transform() method using this:

class TSNE_wrapper(TSNE):
   def transform(X):
   return TSNE().fit_transform(X)

Suggestion : 2

Calling a pipeline with a nonparametric function causes an error since the function transform() is missing. The pipeline itself calls the function fit_transform() if it’s present. For nonparametric functions (the most prominent being t-SNE) a regular transform() method does not exist since there is no projection or mapping that is learned. It could still be used for dimensionality reduction.,In that case, what should the behavior be? Should it check the transform property of the transformer and raise a TypeError before trying to call the function here? Or should it warn at construction time that only fit_transform is available (or that this pipeline can only be called once)?,Oh, I didn’t note that you made no attempt to run any methods on the pipeline. I agree that it should be possible to construct the pipeline and run fit_transform on it, even if it is a somewhat degenerate pipeline. Thanks for reporting!,It should be fine to fit, but raise an error when transform or predict etc are called. It might make it a bit of a nuisance for people who try putting tsne in a pipeline and only get an error after fitting… So I wonder if it’s worth warning in the case that an intermediate step does not support transform.

Example:

from sklearn.decomposition
import PCA
from sklearn.manifold
import TSNE
from sklearn.pipeline
import make_pipeline

make_pipeline(TSNE(), PCA())

Output:

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough'
'TSNE(angle=0.5,...

Editing this https://github.com/scikit-learn/scikit-learn/blob/95d4f0841/sklearn/pipeline.py#L179 to the following in order to reflect the change:

if (not hasattr(t, "fit_transform")) or not(hasattr(t, "fit") and hasattr(t, "transform")):

Suggestion : 3

One very popular method for visualizing document similarity is to use t-distributed stochastic neighbor embedding, t-SNE. Scikit-learn implements this decomposition method as the sklearn.manifold.TSNE transformer. By decomposing high-dimensional document vectors into 2 dimensions using probability distributions from both the original dimensionality and the decomposed dimensionality, t-SNE is able to effectively cluster similar documents. By decomposing to 2 or 3 dimensions, the documents can be visualized with a scatter plot.,Creates an internal transformer pipeline to project the data set into 2D space using TSNE, applying an pre-decomposition technique ahead of embedding if necessary. This method will reset the transformer on the class, and can be used to explore different decompositions.,Called from the fit method, this method draws the TSNE scatter plot, from a set of decomposed points in 2 dimensions. This method also accepts a third dimension, target, which is used to specify the colors of each of the points. If the target is not specified, then the points are plotted as a single cloud to show similar documents.,Specify the number of components for preliminary decomposition, by default this is 50; the more components, the slower TSNE will be.

from sklearn.feature_extraction.text
import TfidfVectorizer

from yellowbrick.text
import TSNEVisualizer
from yellowbrick.datasets
import load_hobbies

# Load the data and create document vectors
corpus = load_hobbies()
tfidf = TfidfVectorizer()

X = tfidf.fit_transform(corpus.data)
y = corpus.target

# Create the visualizer and draw the vectors
tsne = TSNEVisualizer()
tsne.fit(X, y)
tsne.show()
labels = corpus.labels
tsne = TSNEVisualizer(labels = labels)
tsne.fit(X, y)
tsne.show()
from yellowbrick.text.tsne
import tsne
from sklearn.feature_extraction.text
import TfidfVectorizer
from yellowbrick.datasets
import load_hobbies

# Load the data and create document vectors
corpus = load_hobbies()
tfidf = TfidfVectorizer()

X = tfidf.fit_transform(corpus.data)
y = corpus.target

tsne(X, y)

Suggestion : 4

In this chapter, you'll learn about two unsupervised learning techniques for visualization: t-SNE and hierarchical clustering.,After that, import the make_pipeline function from sklearn.pipeline.,After performing the hierarchical clustering of the Eurovision data, import the fcluster function.,Import: TruncatedSVD from sklearn.decomposition. KMeans from sklearn.cluster. make_pipeline from sklearn.pipeline.

import pandas as pd
from pprint
import pprint as pp
from itertools
import combinations
from zipfile
import ZipFile
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d
import Axes3D
import seaborn as sns
import numpy as np
from pathlib
import Path
import requests
import sys

from scipy.sparse
import csr_matrix
from scipy.cluster.hierarchy
import linkage, dendrogram, fcluster
from scipy.stats
import pearsonr

from sklearn.cluster
import KMeans
from sklearn.preprocessing
import StandardScaler, Normalizer, normalize, MaxAbsScaler
from sklearn.pipeline
import make_pipeline
from sklearn.manifold
import TSNE
from sklearn.decomposition
import PCA, TruncatedSVD, NMF
from sklearn.feature_extraction.text
import TfidfVectorizer
import warnings
warnings.simplefilter(action = "ignore", category = UserWarning)
pd.set_option('max_columns', 200)
pd.set_option('max_rows', 300)
pd.set_option('display.expand_frame_repr', True)
plt.rcParams["patch.force_edgecolor"] = True
def create_dir_save_file(dir_path: Path, url: str):
   ""
"
Check
if the path exists and create it
if it does not.
Check
if the file exists and download it
if it does not.
""
"
if not dir_path.parents[0].exists():
   dir_path.parents[0].mkdir(parents = True)
print(f 'Directory Created: {dir_path.parents[0]}')
else:
   print('Directory Exists')

if not dir_path.exists():
   r = requests.get(url, allow_redirects = True)
open(dir_path, 'wb').write(r.content)
print(f 'File Created: {dir_path.name}')
else:
   print('File Exists')
data_dir = Path('data/2021-03-29_unsupervised_learning_python')
images_dir = Path('Images/2021-03-29_unsupervised_learning_python')
# csv files
base = 'https://assets.datacamp.com/production/repositories/655/datasets'
file_spm = base + '/1304e66b1f9799e1a5eac046ef75cf57bb1dd630/company-stock-movements-2010-2015-incl.csv'
file_ev = base + '/2a1f3ab7bcc76eef1b8e1eb29afbd54c4ebf86f2/eurovision-2016.csv'
file_fish = base + '/fee715f8cf2e7aad9308462fea5a26b791eb96c4/fish.csv'
file_lcd = base + '/effd1557b8146ab6e620a18d50c9ed82df990dce/lcd-digits.csv'
file_wine = base + '/2b27d4c4bdd65801a3b5c09442be3cb0beb9eae0/wine.csv'
file_artists_sparse = 'https://raw.githubusercontent.com/trenton3983/DataCamp/master/data/2021-03-29_unsupervised_learning_python/artists_sparse.csv'

# zip files
file_grain = base + '/bb87f0bee2ac131042a01307f7d7e3d4a38d21ec/Grains.zip'
file_musicians = base + '/c974f2f2c4834958cbe5d239557fbaf4547dc8a3/Musical%20artists.zip'
file_wiki = base + '/8e2fbb5b8240c06602336f2148f3c42e317d1fdb/Wikipedia%20articles.zip'

Suggestion : 5

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data:,McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018,Uniform Manifold Approximation and Projection,The important thing is that you don’t need to worry about that—you can use UMAP right now for dimension reduction and visualisation as easily as a drop in replacement for scikit-learn’s t-SNE.

Conda install, via the excellent work of the conda-forge team:

conda install - c conda - forge umap - learn

PyPI install, presuming you have numba and sklearn and all its requirements (numpy and scipy) installed:

pip install umap - learn

If you wish to use the plotting functionality you can use

pip install umap - learn[plot]

If pip is having difficulties pulling the dependencies then we’d suggest installing the dependencies manually using anaconda followed by pulling umap from pip:

conda install numpy scipy
conda install scikit - learn
conda install numba
pip install umap - learn

For a manual install get this package:

wget https: //github.com/lmcinnes/umap/archive/master.zip
   unzip master.zip
rm master.zip
cd umap - master

The umap package inherits from sklearn classes, and thus drops in neatly next to other sklearn transformers with an identical calling API.

import umap
from sklearn.datasets
import load_digits

digits = load_digits()

embedding = umap.UMAP().fit_transform(digits.data)

An example of making use of these options:

import umap
from sklearn.datasets
import load_digits

digits = load_digits()

embedding = umap.UMAP(n_neighbors = 5,
   min_dist = 0.3,
   metric = 'correlation').fit_transform(digits.data)

UMAP includes a subpackage umap.plot for plotting the results of UMAP embeddings. This package needs to be imported separately since it has extra requirements (matplotlib, datashader and holoviews). It allows for fast and simple plotting and attempts to make sensible decisions to avoid overplotting and other pitfalls. An example of use:

import umap
import umap.plot
from sklearn.datasets
import load_digits

digits = load_digits()

mapper = umap.UMAP().fit(digits.data)
umap.plot.points(mapper, labels = digits.target)

The densMAP algorithm augments UMAP to additionally preserve local density information in addition to the topological structure captured by UMAP. One can easily run densMAP using the umap package by setting the densmap input flag:

embedding = umap.UMAP(densmap = True).fit_transform(data)

An example of making use of these options (based on a subsample of the mnist_784 dataset):

import umap
from sklearn.datasets
import fetch_openml
from sklearn.utils
import resample

digits = fetch_openml(name = 'mnist_784')
subsample, subsample_labels = resample(digits.data, digits.target, n_samples = 7000,
   stratify = digits.target, random_state = 1)

embedding, r_orig, r_emb = umap.UMAP(densmap = True, dens_lambda = 2.0, n_neighbors = 30,
   output_dens = True).fit_transform(subsample)

If you make use of this software for your work we would appreciate it if you would cite the paper from the Journal of Open Source Software:

@article {
   mcinnes2018umap - software,
      title = {
         UMAP: Uniform Manifold Approximation and Projection
      },
      author = {
         McInnes,
         Leland and Healy,
         John and Saul,
         Nathaniel and Grossberger,
         Lukas
      },
      journal = {
         The Journal of Open Source Software
      },
      volume = {
         3
      },
      number = {
         29
      },
      pages = {
         861
      },
      year = {
         2018
      }
}

If you would like to cite this algorithm in your work the ArXiv paper is the current reference:

@article {
   2018 arXivUMAP,
   author = {
         {
            McInnes
         },
         L.and {
            Healy
         },
         J.and {
            Melville
         },
         J.
      },
      title = "{UMAP: Uniform Manifold Approximation
   and Projection
   for Dimension Reduction
}
",
journal = {
      ArXiv e - prints
   },
   archivePrefix = "arXiv",
   eprint = {
      1802.03426
   },
   primaryClass = "stat.ML",
   keywords = {
      Statistics - Machine Learning,
      Computer Science - Computational Geometry,
      Computer Science - Learning
   },
   year = 2018,
   month = feb,
}

Additionally, if you use the densMAP algorithm in your work please cite the following reference:

@article {
   NBC2020,
   author = {
      Narayan,
      Ashwin and Berger,
      Bonnie and Cho,
      Hyunghoon
   },
   title = {
      Density - Preserving Data Visualization Unveils Dynamic Patterns of Single - Cell Transcriptomic Variability
   },
   journal = {
      bioRxiv
   },
   year = {
      2020
   },
   doi = {
      10.1101 / 2020.05 .12 .077776
   },
   publisher = {
      Cold Spring Harbor Laboratory
   },
   URL = {
      https: //www.biorxiv.org/content/early/2020/05/14/2020.05.12.077776},
         eprint = {
            https: //www.biorxiv.org/content/early/2020/05/14/2020.05.12.077776.full.pdf},
         }

Suggestion : 6

We’ll take a look at two very simple machine learning tasks here. The first is a classification task: the figure shows a collection of two-dimensional data, colored according to two different class labels. A classification algorithm may be used to draw a dividing boundary between the two clusters of points:,sklearn.manifold.TSNE separates quite well the different classes of digits eventhough it had no access to the class information.,Machine Learning can be considered a subfield of Artificial Intelligence since those algorithms can be seen as building blocks to make computers learn to behave more intelligently by somehow generalizing rather that just storing and retrieving data items like a database system would do.,There exists many different cross-validation strategies in scikit-learn. They are often useful to take in account non iid datasets.

>>> from sklearn.datasets
import load_iris
   >>>
   iris = load_iris()
>>> print(iris.data.shape)
   (150, 4) >>>
   n_samples, n_features = iris.data.shape >>>
   print(n_samples)
150
   >>>
   print(n_features)
4
   >>>
   print(iris.data[0])[5.1 3.5 1.4 0.2]
>>> print(iris.target.shape)
   (150, ) >>>
   print(iris.target)[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
>>> print(iris.target_names)['setosa'
   'versicolor'
   'virginica']
>>> from sklearn.linear_model
import LinearRegression
>>> model = LinearRegression(n_jobs = 1, normalize = True) >>>
   print(model.normalize)
True
   >>>
   print(model)
LinearRegression(n_jobs = 1, normalize = True)