scikit learn - combining output of tfidfvectorizer and onehotencoder - dimensionality

  • Last Update :
  • Techknowledgy :

Title and description are both free text and so I am passing them through TfidfVectorizer. Customer name is a category, for this I am using OneHotEncoder. I want these to work within a pipeline so have them being joined with a column transformer where I can pass in an entire dataframe and have it be processed.

file = "train_data.csv"
train_data = pd.read_csv(train_file)
string_features = ['Title', 'Description']
string_transformer = Pipeline(steps = [('tfidf', TfidfVectorizer())) categorical_features = ['Customer'] categorical_transformer = Pipeline(steps = [('OHE', preprocessing.OneHotEncoder())) preprocessor = ColumnTransformer(transformers = [('str', string_transformer, string_features), ('cat', categorical_transformer, categorical_features)]) clf = Pipeline(steps = [('preprocessor', preprocessor), ('clf', SGDClassifier())] X_train = train_data.drop('Team', axis = 1) y_train = train_data['Team'] clf.fit(X_train, y_train)

Suggestion : 2

However I get an error: all the input array anycodings_python dimensions except for the concatenation axis anycodings_python must match exactly.,I am currently developing a machine learning anycodings_python algorithm for ticket classification that anycodings_python combines a Title, Description and Customer anycodings_python name together to predict what team a ticket anycodings_python should be assigned to but have been stuck anycodings_python for the past few days. ,After looking into it, anycodings_python print(OneHotEncoder().fit_transform(X_train['Customer'])) anycodings_python on its own returns an error: Expected 2d anycodings_python array got 1d array instead.,Tic Tac Toe runs and functions completely, but after getting 3 in a row, another move has to be made for it to end

Title and description are both free text and anycodings_python so I am passing them through anycodings_python TfidfVectorizer. Customer name is a anycodings_python category, for this I am using OneHotEncoder. anycodings_python I want these to work within a pipeline so anycodings_python have them being joined with a column anycodings_python transformer where I can pass in an entire anycodings_python dataframe and have it be processed.

file = "train_data.csv"
train_data = pd.read_csv(train_file)
string_features = ['Title', 'Description']
string_transformer = Pipeline(steps = [('tfidf', TfidfVectorizer())) categorical_features = ['Customer'] categorical_transformer = Pipeline(steps = [('OHE', preprocessing.OneHotEncoder())) preprocessor = ColumnTransformer(transformers = [('str', string_transformer, string_features), ('cat', categorical_transformer, categorical_features)]) clf = Pipeline(steps = [('preprocessor', preprocessor), ('clf', SGDClassifier())] X_train = train_data.drop('Team', axis = 1) y_train = train_data['Team'] clf.fit(X_train, y_train)

Suggestion : 3

In this chapter, we will demonstrate how to use the vectorization process to combine linguistic techniques from NLTK with machine learning techniques in Scikit-Learn and Gensim, creating custom transformers that can be used inside repeatable and reusable pipelines. By the end of this chapter, we will be ready to engage our preprocessed corpus, transforming documents to model space so that we can begin making predictions.,In this chapter, we conducted a whirlwind overview of vectorization techniques and began to consider their use cases for different kinds of data and different machine learning algorithms. In practice, it is best to select an encoding scheme based on the problem at hand; certain methods substantially outperform others for certain tasks.,Finally, we must add the Transformer interface, allowing us to add this class to a Scikit-Learn pipeline, which we’ll explore in the next section:,Extending Scikit-Learn’s BaseEstimator automatically gives the Estimator a fit_predict method, which allows you to combine fit and predict in one simple call.

The defaultdict object allows us to specify what the dictionary will return for a key that hasn’t been assigned to it yet. By setting defaultdict(int) we are specifying that a 0 should be returned, thus creating a simple counting dictionary. We can map this function to every item in the corpus using the last line of code, creating an iterable of vectorized documents.

from collections
import defaultdict

def vectorize(doc):
   features = defaultdict(int)
for token in tokenize(doc):
   features[token] += 1
return features

vectors = map(vectorize, corpus)

The CountVectorizer transformer from the sklearn.feature_extraction model has its own internal tokenization and normalization methods. The fit method of the vectorizer expects an iterable or list of strings or file objects, and creates a dictionary of the vocabulary on the corpus. When transform is called, each individual document is transformed into a sparse array whose index tuple is the row (the document ID) and the token ID from the dictionary, and whose value is the count:

from sklearn.feature_extraction.text
import CountVectorizer

vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(corpus)

Gensim’s frequency encoder is called doc2bow. To use doc2bow, we first create a Gensim Dictionary that maps tokens to indices based on observed order (eliminating the overhead of lexicographic sorting). The dictionary object can be loaded or saved to disk, and implements a doc2bow library that accepts a pretokenized document and returns a sparse matrix of (id, count) tuples where the id is the token’s id in the dictionary. Because the doc2bow method only takes a single document instance, we use the list comprehension to restore the entire corpus, loading the tokenized documents into memory so we don’t exhaust our generator:

import gensim

corpus = [tokenize(doc) for doc in corpus]
id2word = gensim.corpora.Dictionary(corpus)
vectors = [
   id2word.doc2bow(doc) for doc in corpus
]

The NLTK implementation of one-hot encoding is a dictionary whose keys are tokens and whose value is True:

def vectorize(doc):
   return {
      token: True
      for token in doc
   }

vectors = map(vectorize, corpus)

In Scikit-Learn, one-hot encoding is implemented with the Binarizer transformer in the preprocessing module. The Binarizer takes only numeric data, so the text data must be transformed into a numeric space using the CountVectorizer ahead of one-hot encoding. The Binarizer class uses a threshold value (0 by default) such that all values of the vector that are less than or equal to the threshold are set to zero, while those that are greater than the threshold are set to 1. Therefore, by default, the Binarizer converts all frequency values to 1 while maintaining the zero-valued frequencies.

from sklearn.preprocessing
import Binarizer

freq = CountVectorizer()
corpus = freq.fit_transform(corpus)

onehot = Binarizer()
corpus = onehot.fit_transform(corpus.toarray())

While Gensim does not have a specific one-hot encoder, its doc2bow method returns a list of tuples that we can manage on the fly. Extending the code from the Gensim frequency vectorization example in the previous section, we can one-hot encode our vectors with our id2word dictionary. To get our vectors, an inner list comprehension converts the list of tuples returned from the doc2bow method into a list of (token_id, 1) tuples and the outer comprehension applies that converter to all documents in the corpus:

corpus = [tokenize(doc) for doc in corpus]
id2word = gensim.corpora.Dictionary(corpus)
vectors = [
   [(token[0], 1) for token in id2word.doc2bow(doc)]
   for doc in corpus
]

Suggestion : 4

Scikit Learn - Combining output of TfidfVectorizer and OneHotEncoder - dimensionality,How to vectorize a data frame with several text columns in scikit learn without losing track of the origin columns,How to use Scikit Learn dictvectorizer to get encoded dataframe from dense dataframe in Python?,Dummify categorical variables for logistic regression with pandas and scikit (OneHotEncoder)

i.e

df2['predictions'] = df2['values'].apply(kmeans.predict)

Suggestion : 5

OneHotEncoder can transform multiple columns, returning an one-hot-encoded output vector column for each input column. It is common to merge these vectors into a single feature vector using VectorAssembler.,Interaction is a Transformer which takes vector or double-valued columns, and generates a single vector column that contains the product of all combinations of one value from each input column.,For example, if you have 2 vector type columns each of which has 3 dimensions as input columns, then you’ll get a 9-dimensional vector as the output column.,The FeatureHasher transformer operates on multiple columns. Each column may contain either numeric or categorical features. Behavior and handling of column data types is as follows:

import org.apache.spark.ml.feature. {
   HashingTF,
   IDF,
   Tokenizer
}

val sentenceData = spark.createDataFrame(Seq(
   (0.0, "Hi I heard about Spark"),
   (0.0, "I wish Java could use case classes"),
   (1.0, "Logistic regression models are neat")
)).toDF("label", "sentence")

val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)

val hashingTF = new HashingTF()
   .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)

val featurizedData = hashingTF.transform(wordsData)
// alternatively, CountVectorizer can also be used to get term frequency vectors

val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)

val rescaledData = idfModel.transform(featurizedData)
rescaledData.select("label", "features").show()
import java.util.Arrays;
import java.util.List;

import org.apache.spark.ml.feature.HashingTF;
import org.apache.spark.ml.feature.IDF;
import org.apache.spark.ml.feature.IDFModel;
import org.apache.spark.ml.feature.Tokenizer;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

List<Row> data = Arrays.asList(
  RowFactory.create(0.0, "Hi I heard about Spark"),
  RowFactory.create(0.0, "I wish Java could use case classes"),
  RowFactory.create(1.0, "Logistic regression models are neat")
);
StructType schema = new StructType(new StructField[]{
  new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
  new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> sentenceData = spark.createDataFrame(data, schema);

Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
Dataset<Row> wordsData = tokenizer.transform(sentenceData);

int numFeatures = 20;
HashingTF hashingTF = new HashingTF()
  .setInputCol("words")
  .setOutputCol("rawFeatures")
  .setNumFeatures(numFeatures);

Dataset<Row> featurizedData = hashingTF.transform(wordsData);
// alternatively, CountVectorizer can also be used to get term frequency vectors

IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features");
IDFModel idfModel = idf.fit(featurizedData);

Dataset<Row> rescaledData = idfModel.transform(featurizedData);
rescaledData.select("label", "features").show();
from pyspark.ml.feature
import HashingTF, IDF, Tokenizer

sentenceData = spark.createDataFrame([
   (0.0, "Hi I heard about Spark"),
   (0.0, "I wish Java could use case classes"),
   (1.0, "Logistic regression models are neat")
], ["label", "sentence"])

tokenizer = Tokenizer(inputCol = "sentence", outputCol = "words")
wordsData = tokenizer.transform(sentenceData)

hashingTF = HashingTF(inputCol = "words", outputCol = "rawFeatures", numFeatures = 20)
featurizedData = hashingTF.transform(wordsData)
# alternatively, CountVectorizer can also be used to get term frequency vectors

idf = IDF(inputCol = "rawFeatures", outputCol = "features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

rescaledData.select("label", "features").show()
import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row

// Input data: Each row is a bag of words from a sentence or document.
val documentDF = spark.createDataFrame(Seq(
   "Hi I heard about Spark".split(" "),
   "I wish Java could use case classes".split(" "),
   "Logistic regression models are neat".split(" ")
).map(Tuple1.apply)).toDF("text")

// Learn a mapping from words to Vectors.
val word2Vec = new Word2Vec()
   .setInputCol("text")
   .setOutputCol("result")
   .setVectorSize(3)
   .setMinCount(0)
val model = word2Vec.fit(documentDF)

val result = model.transform(documentDF)
result.collect().foreach {
   case Row(text: Seq[_], features: Vector) =>
      println(s "Text: [${text.mkString(", ")}] => \nVector: $features\n")
}
import java.util.Arrays;
import java.util.List;

import org.apache.spark.ml.feature.Word2Vec;
import org.apache.spark.ml.feature.Word2VecModel;
import org.apache.spark.ml.linalg.Vector;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.*;

// Input data: Each row is a bag of words from a sentence or document.
List<Row> data = Arrays.asList(
  RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
  RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
  RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
);
StructType schema = new StructType(new StructField[]{
  new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
});
Dataset<Row> documentDF = spark.createDataFrame(data, schema);

// Learn a mapping from words to Vectors.
Word2Vec word2Vec = new Word2Vec()
  .setInputCol("text")
  .setOutputCol("result")
  .setVectorSize(3)
  .setMinCount(0);

Word2VecModel model = word2Vec.fit(documentDF);
Dataset<Row> result = model.transform(documentDF);

for (Row row : result.collectAsList()) {
  List<String> text = row.getList(0);
  Vector vector = (Vector) row.get(1);
  System.out.println("Text: " + text + " => \nVector: " + vector + "\n");
}
from pyspark.ml.feature
import Word2Vec

# Input data: Each row is a bag of words from a sentence or document.
documentDF = spark.createDataFrame([
   ("Hi I heard about Spark".split(" "), ),
   ("I wish Java could use case classes".split(" "), ),
   ("Logistic regression models are neat".split(" "), )
], ["text"])

# Learn a mapping from words to Vectors.
word2Vec = Word2Vec(vectorSize = 3, minCount = 0, inputCol = "text", outputCol = "result")
model = word2Vec.fit(documentDF)

result = model.transform(documentDF)
for row in result.collect():
   text, vector = row
print("Text: [%s] => \nVector: %s\n" % (", ".join(text), str(vector)))