You can use '|'.join
on a list of words to create a regex pattern which matches any of the words (at least one)
Then you can use the pandas.Series.str.contains()
method to create a boolean mask for the matches.
import pandas as pd # create regex pattern out of the list of words positive_kw = '|'.join(['rise', 'positive', 'high', 'surge']) negative_kw = '|'.join(['sink', 'lower', 'fall', 'drop', 'slip', 'loss', 'losses']) neutral_kw = '|'.join(['flat', 'neutral']) # creating some fake data for demonstration words = [ 'rise high', 'positive attitude', 'something', 'foo', 'lowercase', 'flat earth', 'neutral opinion' ] df = pd.DataFrame(data = words, columns = ['words']) df['positive'] = df['words'].str.contains(positive_kw).astype(int) df['negative'] = df['words'].str.contains(negative_kw).astype(int) df['neutral'] = df['words'].str.contains(neutral_kw).astype(int) print(df)
Output:
words positive negative neutral 0 rise high 1 0 0 1 positive attitude 1 0 0 2 something 0 0 0 3 foo 0 0 0 4 lowercase 0 1 0 5 flat earth 0 0 1 6 neutral opinion 0 0 1
You can use '|'.join on a list of words anycodings_dataframe to create a regex pattern which matches anycodings_dataframe any of the words (at least one) Then you anycodings_dataframe can use the pandas.Series.str.contains() anycodings_dataframe method to create a boolean mask for the anycodings_dataframe matches.,How can I tell Typescript to use the default value when an empty string is provided as function parameter?,Hey, I've written a small python program for printing a loading bar but it is not working, Could you guys please look into it,How do I find the location of the second largest number in a list? (Python)
I have a dataframe containing many rows of anycodings_dataframe strings: btb['Title']. I would like to anycodings_dataframe identify whether each string contains anycodings_dataframe positive, negative or neutral keywords. The anycodings_dataframe following works but is considerably slow:
positive_kw = ('rise', 'positive', 'high', 'surge')
negative_kw = ('sink', 'lower', 'fall', 'drop', 'slip', 'loss', 'losses')
neutral_kw = ('flat', 'neutral')
#create new columns, turn value to one
if keyword exists in sentence
btb['Positive'] = np.nan
btb['Negative'] = np.nan
btb['Neutral'] = np.nan
#Turn value to one
if keyword exists in sentence
for index, row in btb.iterrows():
if any(s in row.Title
for s in positive_kw) == True:
btb['Positive'].loc[index] = 1
if any(s in row.Title
for s in negative_kw) == True:
btb['Negative'].loc[index] = 1
if any(s in row.Title
for s in neutral_kw) == True:
btb['Neutral'].loc[index] = 1
You can use '|'.join on a list of words anycodings_dataframe to create a regex pattern which matches anycodings_dataframe any of the words (at least one) Then you anycodings_dataframe can use the pandas.Series.str.contains() anycodings_dataframe method to create a boolean mask for the anycodings_dataframe matches.
import pandas as pd # create regex pattern out of the list of words positive_kw = '|'.join(['rise', 'positive', 'high', 'surge']) negative_kw = '|'.join(['sink', 'lower', 'fall', 'drop', 'slip', 'loss', 'losses']) neutral_kw = '|'.join(['flat', 'neutral']) # creating some fake data for demonstration words = [ 'rise high', 'positive attitude', 'something', 'foo', 'lowercase', 'flat earth', 'neutral opinion' ] df = pd.DataFrame(data = words, columns = ['words']) df['positive'] = df['words'].str.contains(positive_kw).astype(int) df['negative'] = df['words'].str.contains(negative_kw).astype(int) df['neutral'] = df['words'].str.contains(neutral_kw).astype(int) print(df)
Output:
words positive negative neutral 0 rise high 1 0 0 1 positive attitude 1 0 0 2 something 0 0 0 3 foo 0 0 0 4 lowercase 0 1 0 5 flat earth 0 0 1 6 neutral opinion 0 0 1
Find multiple strings in a given column,Find a row matching multiple column criteria,More efficient way to find multiple keywords in column of strings pandas,Sum values in a column based on strings from multiple columns in pandas/python
Option 1
In[691]: np.array([np.where(df1.Texts.str.contains(x.SubString), x.Score, 0)
for _, x in df2.iterrows()
]).sum(axis = 0)
Out[691]: array([0.75, 0., -0.3, 0.2, 0.45, 0.2])
Option 2
In[674]: df1.Texts.apply(lambda x: df2.Score[df2.SubString.apply(lambda y: y in x)].sum())
Out[674]:
0 0.75
1 0.00
2 - 0.30
3 0.20
4 0.45
5 0.20
Name: Texts, dtype: float64
To make each of the strings in the Name column lowercase, select the Name column (see the tutorial on selection of data), add the str accessor and apply the lower method. As such, each of the strings is converted element-wise.,Based on the index name of the row (307) and the column (Name), we can do a selection using the loc operator, introduced in the tutorial on subsetting.,To get the longest name we first have to get the lengths of each of the names in the Name column. By using pandas string methods, the Series.str.len() function is applied to each of the names individually (element-wise).,Next, we need to get the corresponding location, preferably the index label, in the table for which the name length is the largest. The idxmax() method does exactly that. It is not a string method and is applied to integers, so no str is used.
In[1]: import pandas as pd
In[2]: titanic = pd.read_csv("data/titanic.csv")
In[3]: titanic.head()
Out[3]:
PassengerId Survived Pclass Name Sex...Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr.Owen Harris male...0 A / 5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs.John Bradley(Florence Briggs Th...female...0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss.Laina female...0 STON / O2.3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs.Jacques Heath(Lily May Peel) female...0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr.William Henry male...0 373450 8.0500 NaN S
[5 rows x 12 columns]
In[4]: titanic["Name"].str.lower()
Out[4]:
0 braund, mr.owen harris
1 cumings, mrs.john bradley(florence briggs th...
2 heikkinen, miss.laina 3 futrelle, mrs.jacques heath(lily may peel) 4 allen, mr.william henry
...
886 montvila, rev.juozas 887 graham, miss.margaret edith 888 johnston, miss.catherine helen "carrie"
889 behr, mr.karl howell 890 dooley, mr.patrick Name: Name, Length: 891, dtype: object
In[5]: titanic["Name"].str.split(",")
Out[5]:
0[Braund, Mr.Owen Harris]
1[Cumings, Mrs.John Bradley(Florence Briggs...
2[Heikkinen, Miss.Laina] 3[Futrelle, Mrs.Jacques Heath(Lily May Peel)] 4[Allen, Mr.William Henry]
...
886[Montvila, Rev.Juozas] 887[Graham, Miss.Margaret Edith] 888[Johnston, Miss.Catherine Helen "Carrie"] 889[Behr, Mr.Karl Howell] 890[Dooley, Mr.Patrick] Name: Name, Length: 891, dtype: object
In[6]: titanic["Surname"] = titanic["Name"].str.split(",").str.get(0)
In[7]: titanic["Surname"]
Out[7]:
0 Braund
1 Cumings
2 Heikkinen
3 Futrelle
4 Allen
...
886 Montvila
887 Graham
888 Johnston
889 Behr
890 Dooley
Name: Surname, Length: 891, dtype: object
In[8]: titanic["Name"].str.contains("Countess")
Out[8]:
0 False
1 False
2 False
3 False
4 False
...
886 False
887 False
888 False
889 False
890 False
Name: Name, Length: 891, dtype: bool
Mar 14, 2022 , Mar 17, 2022 , Mar 18, 2022 , Apr 11, 2022
user_df['name'].str.split()