IIUIC, use value_counts()
In[3361]: df.Firm_Name.str.split(expand = True).stack().value_counts() Out[3361]: Society 3 Ltd 2 James 's 1 R.X.1 Yah 1 Associates 1 St 1 Kensington 1 MMV 1 Big 1 & 1 The 1 Co 1 Oil 1 Building 1 dtype: int64
Or,
pd.Series(np.concatenate([x.split() for x in df.Firm_Name])).value_counts()
For top N, for example 3
In[3379]: pd.Series(' '.join(df.Firm_Name).split()).value_counts()[: 3] Out[3379]: Society 3 Ltd 2 James 's 1 dtype: int64
You need str.cat
with lower
first for concanecate all values to one string
, then need word_tokenize
and last use your solution:
top_N = 4
#if not necessary all lower
a = data['Firm_Name'].str.lower().str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(a)
word_dist = nltk.FreqDist(words)
print (word_dist)
<FreqDist with 17 samples and 20 outcomes>
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
print(rslt)
Word Frequency
0 society 3
1 ltd 2
2 the 1
3 co 1
Also is possible remove lower
if necessary:
top_N = 4
a = data['Firm_Name'].str.cat(sep = ' ')
words = nltk.tokenize.word_tokenize(a)
word_dist = nltk.FreqDist(words)
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns = ['Word', 'Frequency'])
print(rslt)
Word Frequency
0 Society 3
1 Ltd 2
2 MMV 1
3 Kensington 1
This answer can also be used - Count distinct words from a Pandas Data Frame. It utilizes the Counter method and applies it to each row.
from collections
import Counter
c = Counter()
df = pd.DataFrame(
[
[104472, "R.X. Yah & Co"],
[104873, "Big Building Society"],
[109986, "St James's Society"],
[114058, "The Kensington Society Ltd"],
[113438, "MMV Oil Associates Ltd"]
], columns = ["URN", "Firm_Name"])
df.Firm_Name.str.split().apply(c.update)
Counter({
'R.X.': 1,
'Yah': 1,
'&': 1,
'Co': 1,
'Big': 1,
'Building': 1,
'Society': 3,
'St': 1,
"James's": 1,
'The': 1,
'Kensington': 1,
'Ltd': 2,
'MMV': 1,
'Oil': 1,
'Associates': 1
})
Last Updated : 19 Aug, 2022
Input: str[] = "Apple Mango Orange Mango Guava Guava Mango"
Output: frequency of Apple is: 1
frequency of Mango is: 3
frequency of Orange is: 1
frequency of Guava is: 2
Input: str = "Train Bus Bus Train Taxi Aeroplane Taxi Bus"
Output: frequency of Train is: 2
frequency of Bus is: 3
frequency of Taxi is: 2
frequency of Aeroplane is: 1
1. Split the string into a list containing the words by using split function (i.e. string.split()) in python with delimiter space.
Note:
string_name.split(separator) method is used to split the string
by specified separator(delimiter) into the list.
If delimiter is not provided then white space is a separator.
For example:
CODE: str = 'This is my book'
str.split()
OUTPUT: ['This', 'is', 'my', 'book']
4. Iterate over the new list and use count function (i.e. string.count(newstring[iteration])) to find the frequency of word at each iteration.
Note:
string_name.count(substring) is used to find no.of occurrence of
substring in a given string.
For example:
CODE: str = 'Apple Mango Apple'
str.count('Apple')
str2 = 'Apple'
str.count(str2)
OUTPUT: 2
2
Frequency of apple is: 3 Frequency of mango is: 3 Frequency of orange is: 2 Frequency of guava is: 1
Frequency of guava is: 1 Frequency of orange is: 2 Frequency of mango is: 3 Frequency of apple is: 3
Frequency of Apple: 1 Frequency of Mango: 3 Frequency of Orange: 1 Frequency of Guava: 2
To count the frequency of each word in a string, you’ll first have to tokenize the string into individual words. Then, you can use the collections.Counter module to count each element in the list resulting in a dictionary of word counts. The following is the syntax:,Let’s find the frequency of each word in the positive and the negative corpus. For this, we’ll use collections.Counter that returns an object which is essentially a dictionary with word to frequency mappings.,In this tutorial, we’ll look at how to count the frequency of each word in a string corpus in python. We’ll also compare the frequency with visualizations like bar charts. ,You can use the string split() function to create a list of individual tokens from a string. For example,
To count the frequency of each word in a string, you’ll first have to tokenize the string into individual words. Then, you can use the collections.Counter module to count each element in the list resulting in a dictionary of word counts. The following is the syntax:
import collections
s = "the cat and the dog are fighting"
s_counts = collections.Counter(s.split(" "))
First we load the data as a pandas dataframe using the read_csv() function.
import pandas as pd # read the csv file as a dataframe reviews_df = pd.read_csv(r "C:\Users\piyush\Documents\Projects\movie_reviews_data\IMDB Dataset.csv") print(reviews_df.head())
Output:
review sentiment
0 One of the other reviewers has mentioned that ... positive
1 A wonderful little production. <br /><br />The... positive
2 I thought this was a wonderful way to spend ti... positive
3 Basically there's a family where a little boy ... negative
4 Petter Mattei's "Love in the Time of Money" is... positive
If we look at the entries in the “review” column, we can find that the reviews contain a number of unwanted elements or styles such as HTML tags, punctuations, inconsistent use of lower and upper case, etc. that could hinder our analysis. For example,
print(reviews_df['review'][1])
You can see that in the above review, we have HTML tags, quotes, punctuations, etc. that could be cleaned. Let’s write a function to clean the text in the reviews.
import re import string def clean_text(text): "" " Function to clean the text. Parameters: text: the raw text as a string value that needs to be cleaned Returns: cleaned_text: the cleaned text as string "" " # convert to lower case cleaned_text = text.lower() # remove HTML tags html_pattern = re.compile('<.*?>') cleaned_text = re.sub(html_pattern, '', cleaned_text) # remove punctuations cleaned_text = cleaned_text.translate(str.maketrans('', '', string.punctuation)) return cleaned_text.strip()
March 13, 2021
This tutorial will show you have to leverage NLTK to create word frequency counts and use these to create a word cloud. Let’s review the code below or watch the video presentation.
# loading in all the essentials
for data manipulation
import pandas as pd
import numpy as np
#load in the NTLK stopwords to remove articles, preposition and other words that are not actionable
from nltk.corpus
import stopwords
# This allows to create individual objects from a bog of words
from nltk.tokenize
import word_tokenize
# Lemmatizer helps to reduce words to the base formfrom nltk.stem
import WordNetLemmatizer
# Ngrams allows to group words in common pairs or trigrams..etc
from nltk
import ngrams
# We can use counter to count the objects from collections
import Counter
# This is our visual library
import seaborn as sns
import matplotlib.pyplot as plt
Now that you have the basic libraries. You can review the function below that cleans the text, lowers, removes numbers, and creates data frames for word counts
def word_frequency(sentence):
# joins all the sentenses
sentence = ”“.join(sentence)
# creates tokens, creates lower class, removes numbers and lemmatizes the words
new_tokens = word_tokenize(sentence)
new_tokens = [t.lower() for t in new_tokens]
new_tokens = [t
for t in new_tokens
if t not in stopwords.words(‘english’)
]
new_tokens = [t
for t in new_tokens
if t.isalpha()
]
lemmatizer = WordNetLemmatizer()
new_tokens = [lemmatizer.lemmatize(t) for t in new_tokens]
#counts the words, pairs and trigrams
counted = Counter(new_tokens)
counted_2 = Counter(ngrams(new_tokens, 2))
counted_3 = Counter(ngrams(new_tokens, 3))
#creates 3 data frames and returns thems
word_freq = pd.DataFrame(counted.items(), columns = [‘word’, ’frequency’]).sort_values(by = ’frequency’, ascending = False)
word_pairs = pd.DataFrame(counted_2.items(), columns = [‘pairs’, ’frequency’]).sort_values(by = ’frequency’, ascending = False)
trigrams = pd.DataFrame(counted_3.items(), columns = [‘trigrams’, ’frequency’]).sort_values(by = ’frequency’, ascending = False)
return word_freq, word_pairs, trigrams
The next step would be to visualize these words so that you can see how they stack up in terms of frequency.
# create subplot of the different data frames fig, axes = plt.subplots(3, 1, figsize = (8, 20)) sns.barplot(ax = axes[0], x = 'frequency', y = 'word', data = data2.head(30)) sns.barplot(ax = axes[1], x = 'frequency', y = 'pairs', data = data3.head(30)) sns.barplot(ax = axes[2], x = 'frequency', y = 'trigrams', data = data4.head(30))