Code of main script below:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer from sklearn.cluster import KMeans from sklearn.preprocessing import normalize from sklearn.decomposition import LatentDirichletAllocation, NMF from sklearn.preprocessing import normalize import pandas as pd from preprocess import * # loading data raw_text_data = loading_bbc_datasets(5) text_data = text_preparing(raw_text_data) tf_vectorizer = TfidfVectorizer() Y = tf_vectorizer.fit_transform(text_data) Y_norm = normalize(Y) nmf = NMF(n_components = 5, random_state = 1, alpha = .1, l1_ratio = 0.5) A = nmf.fit_transform(Y_norm) X = nmf.components_ features = tf_vectorizer.get_feature_names() print(features) AF = pd.DataFrame(Y_norm.toarray()) WF = pd.DataFrame(A) HF = pd.DataFrame(X) AF.to_csv('Y.csv', sep = ',', header = features) WF.to_csv('A.csv', sep = ',', header = ['C1', 'C2', 'C3', 'C4', 'C5']) HF.to_csv('X.csv', sep = ',', header = features)
NMF is not a classification method, it anycodings_nmf is a dimensionality reduction method. anycodings_nmf When you process your texts with anycodings_nmf CountVectorizer, you have a high number anycodings_nmf of dimensions and NMF allows to reduce anycodings_nmf it.,I am working on implementing a Python script anycodings_python for NMF text data clustering. In my work I anycodings_python am using Scikit NMF implementation, however anycodings_python as I understand, in Scikit NMF is more like anycodings_python classification method than a clustering anycodings_python method.,To answer your question, you can cluster anycodings_nmf the documents into topics and represent anycodings_nmf each topic in a human-friendly manner by anycodings_nmf giving the most related words.,Similarly, for a given topic, you can anycodings_nmf get the words highly related to it using anycodings_nmf H.
Code of main script below:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer from sklearn.cluster import KMeans from sklearn.preprocessing import normalize from sklearn.decomposition import LatentDirichletAllocation, NMF from sklearn.preprocessing import normalize import pandas as pd from preprocess import * # loading data raw_text_data = loading_bbc_datasets(5) text_data = text_preparing(raw_text_data) tf_vectorizer = TfidfVectorizer() Y = tf_vectorizer.fit_transform(text_data) Y_norm = normalize(Y) nmf = NMF(n_components = 5, random_state = 1, alpha = .1, l1_ratio = 0.5) A = nmf.fit_transform(Y_norm) X = nmf.components_ features = tf_vectorizer.get_feature_names() print(features) AF = pd.DataFrame(Y_norm.toarray()) WF = pd.DataFrame(A) HF = pd.DataFrame(X) AF.to_csv('Y.csv', sep = ',', header = features) WF.to_csv('A.csv', sep = ',', header = ['C1', 'C2', 'C3', 'C4', 'C5']) HF.to_csv('X.csv', sep = ',', header = features)
Find two non-negative matrices (W, H) whose product approximates the non- negative matrix X. This factorization can be used for example for dimensionality reduction, source separation or topic extraction.,Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation,Fevotte, C., & Idier, J. (2011). Algorithms for nonnegative matrix factorization with the beta-divergence. Neural Computation, 23(9).,Cichocki, Andrzej, and P. H. A. N. Anh-Huy. “Fast local algorithms for large scale nonnegative matrix and tensor factorizations.” IEICE transactions on fundamentals of electronics, communications and computer sciences 92.3: 708-721, 2009.
The objective function is:
0.5 * || X - WH || _Fro ^ 2 + alpha * l1_ratio * || vec(W) || _1 + alpha * l1_ratio * || vec(H) || _1 + 0.5 * alpha * (1 - l1_ratio) * || W || _Fro ^ 2 + 0.5 * alpha * (1 - l1_ratio) * || H || _Fro ^ 2
Where:
|| A || _Fro ^ 2 = \sum_ { i, j } A_ { ij } ^ 2(Frobenius norm) || vec(A) || _1 = \sum_ { i, j } abs(A_ { ij })(Elementwise L1 norm)
Examples
>>>
import numpy as np
>>>
X = np.array([
[1, 1],
[2, 1],
[3, 1.2],
[4, 1],
[5, 0.8],
[6, 1]
]) >>>
from sklearn.decomposition
import NMF
>>>
model = NMF(n_components = 2, init = 'random', random_state = 0) >>>
W = model.fit_transform(X) >>>
H = model.components_
February 22, 2022
TF-IDF is calculated by multiplying term frequency and inverse document frequency.
TF - IDF = TF * IDF
Then, the formula would be:
TF - IDF = (10 / 100) * log(200 / 30)
Here is how to query the Wikipedia API using the Python requests library.
import requests
main_subject = 'Machine learning'
url = 'https://en.wikipedia.org/w/api.php'
params = {
'action': 'query',
'format': 'json',
'generator': 'links',
'titles': main_subject,
'prop': 'pageprops',
'ppprop': 'wikibase_item',
'gpllimit': 1000,
'redirects': 1
}
r = requests.get(url, params = params)
r_json = r.json()
linked_pages = r_json['query']['pages']
page_titles = [p['title']
for p in linked_pages.values()
]
Note that you may need to install tqdm and lxml packages to use them.
import requests
from lxml
import html
from tqdm.notebook
import tqdm
text_db = []
for page in tqdm(pages):
response = requests.get(
'https://en.wikipedia.org/w/api.php',
params = {
'action': 'parse',
'page': page,
'format': 'json',
'prop': 'text',
'redirects': ''
}
).json()
raw_html = response['parse']['text']['*']
document = html.document_fromstring(raw_html)
text = ''
for p in document.xpath('//p'):
text += p.text_content()
text_db.append(text)
print('Done')
This query will return a list in which each element represent the text of the corresponding Wikipedia page.
# # Print number of articles print('Number of articles extracted: ', len(text_db))