nmf as a clustering method in python scikit

  • Last Update :
  • Techknowledgy :

Code of main script below:

from sklearn.feature_extraction.text
import TfidfVectorizer, CountVectorizer
from sklearn.cluster
import KMeans
from sklearn.preprocessing
import normalize
from sklearn.decomposition
import LatentDirichletAllocation, NMF
from sklearn.preprocessing
import normalize
import pandas as pd
from preprocess
import *

# loading data
raw_text_data = loading_bbc_datasets(5)

text_data = text_preparing(raw_text_data)
tf_vectorizer = TfidfVectorizer()
Y = tf_vectorizer.fit_transform(text_data)
Y_norm = normalize(Y)

nmf = NMF(n_components = 5, random_state = 1, alpha = .1, l1_ratio = 0.5)
A = nmf.fit_transform(Y_norm)
X = nmf.components_
features = tf_vectorizer.get_feature_names()
print(features)

AF = pd.DataFrame(Y_norm.toarray())
WF = pd.DataFrame(A)
HF = pd.DataFrame(X)

AF.to_csv('Y.csv', sep = ',', header = features)
WF.to_csv('A.csv', sep = ',', header = ['C1', 'C2', 'C3', 'C4', 'C5'])
HF.to_csv('X.csv', sep = ',', header = features)

Suggestion : 2

NMF is not a classification method, it anycodings_nmf is a dimensionality reduction method. anycodings_nmf When you process your texts with anycodings_nmf CountVectorizer, you have a high number anycodings_nmf of dimensions and NMF allows to reduce anycodings_nmf it.,I am working on implementing a Python script anycodings_python for NMF text data clustering. In my work I anycodings_python am using Scikit NMF implementation, however anycodings_python as I understand, in Scikit NMF is more like anycodings_python classification method than a clustering anycodings_python method.,To answer your question, you can cluster anycodings_nmf the documents into topics and represent anycodings_nmf each topic in a human-friendly manner by anycodings_nmf giving the most related words.,Similarly, for a given topic, you can anycodings_nmf get the words highly related to it using anycodings_nmf H.

Code of main script below:

from sklearn.feature_extraction.text
import TfidfVectorizer, CountVectorizer
from sklearn.cluster
import KMeans
from sklearn.preprocessing
import normalize
from sklearn.decomposition
import LatentDirichletAllocation, NMF
from sklearn.preprocessing
import normalize
import pandas as pd
from preprocess
import *

# loading data
raw_text_data = loading_bbc_datasets(5)

text_data = text_preparing(raw_text_data)
tf_vectorizer = TfidfVectorizer()
Y = tf_vectorizer.fit_transform(text_data)
Y_norm = normalize(Y)

nmf = NMF(n_components = 5, random_state = 1, alpha = .1, l1_ratio = 0.5)
A = nmf.fit_transform(Y_norm)
X = nmf.components_
features = tf_vectorizer.get_feature_names()
print(features)

AF = pd.DataFrame(Y_norm.toarray())
WF = pd.DataFrame(A)
HF = pd.DataFrame(X)

AF.to_csv('Y.csv', sep = ',', header = features)
WF.to_csv('A.csv', sep = ',', header = ['C1', 'C2', 'C3', 'C4', 'C5'])
HF.to_csv('X.csv', sep = ',', header = features)

Suggestion : 3

Find two non-negative matrices (W, H) whose product approximates the non- negative matrix X. This factorization can be used for example for dimensionality reduction, source separation or topic extraction.,Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation,Fevotte, C., & Idier, J. (2011). Algorithms for nonnegative matrix factorization with the beta-divergence. Neural Computation, 23(9).,Cichocki, Andrzej, and P. H. A. N. Anh-Huy. “Fast local algorithms for large scale nonnegative matrix and tensor factorizations.” IEICE transactions on fundamentals of electronics, communications and computer sciences 92.3: 708-721, 2009.

The objective function is:

0.5 * || X - WH || _Fro ^ 2 +
   alpha * l1_ratio * || vec(W) || _1 +
   alpha * l1_ratio * || vec(H) || _1 +
   0.5 * alpha * (1 - l1_ratio) * || W || _Fro ^ 2 +
   0.5 * alpha * (1 - l1_ratio) * || H || _Fro ^ 2

Where:

|| A || _Fro ^ 2 = \sum_ {
   i,
   j
}
A_ {
   ij
} ^ 2(Frobenius norm) ||
   vec(A) || _1 = \sum_ {
      i,
      j
   }
abs(A_ {
   ij
})(Elementwise L1 norm)

Examples

>>>
import numpy as np
   >>>
   X = np.array([
      [1, 1],
      [2, 1],
      [3, 1.2],
      [4, 1],
      [5, 0.8],
      [6, 1]
   ]) >>>
   from sklearn.decomposition
import NMF
   >>>
   model = NMF(n_components = 2, init = 'random', random_state = 0) >>>
   W = model.fit_transform(X) >>>
   H = model.components_

Suggestion : 4

February 22, 2022

TF-IDF is calculated by multiplying term frequency and inverse document frequency.

TF - IDF = TF * IDF

Then, the formula would be:

TF - IDF = (10 / 100) * log(200 / 30)

Here is how to query the Wikipedia API using the Python requests library.

import requests

main_subject = 'Machine learning'

url = 'https://en.wikipedia.org/w/api.php'
params = {
   'action': 'query',
   'format': 'json',
   'generator': 'links',
   'titles': main_subject,
   'prop': 'pageprops',
   'ppprop': 'wikibase_item',
   'gpllimit': 1000,
   'redirects': 1
}

r = requests.get(url, params = params)
r_json = r.json()
linked_pages = r_json['query']['pages']

page_titles = [p['title']
   for p in linked_pages.values()
]

Note that you may need to install tqdm and lxml packages to use them.

import requests
from lxml
import html
from tqdm.notebook
import tqdm

text_db = []
for page in tqdm(pages):
   response = requests.get(
      'https://en.wikipedia.org/w/api.php',
      params = {
         'action': 'parse',
         'page': page,
         'format': 'json',
         'prop': 'text',
         'redirects': ''
      }
   ).json()

raw_html = response['parse']['text']['*']
document = html.document_fromstring(raw_html)
text = ''
for p in document.xpath('//p'):
   text += p.text_content()
text_db.append(text)
print('Done')

This query will return a list in which each element represent the text of the corresponding Wikipedia page.

# # Print number of articles
print('Number of articles extracted: ', len(text_db))