Using groupby
+ transform
to sum a Boolean Series, which we use to mask the original DataFrame.
m = df['c1'].eq(1).groupby([df['c2'], df['c3']]).transform('sum').ge(2) # Alterntively assign the column #m = df.assign(to_sum = df.c1.eq(1)).groupby(['c2', 'c3']).to_sum.transform('sum').ge(2) df.loc[m] # c1 c2 c3 #3 1 1 1 # 4 1 1 1 #5 0 1 1
With filter, count
is not the correct logic. Use ==
(or .eq()
) to check where 'c1'
is equal to the specific value. Sum the Boolean Series and check that there are at least 2 such occurrences per group for your filter.
df.groupby(['c2', 'c3']).filter(lambda x: x['c1'].eq(1).sum() >= 2)
# c1 c2 c3
#3 1 1 1
# 4 1 1 1
#5 0 1 1
While not noticeable for a small DataFrame, filter
with a lambda
is horribly slow as the number of groups grows. transform
is fast:
import numpy as np np.random.seed(123) df = pd.DataFrame({ 'c1': np.random.randint(1, 100, 1000), 'c2': np.random.randint(1, 100, 1000), 'c3': np.random.choice([1, 0], 1000) }) % % timeit m = df['c1'].eq(1).groupby([df.c3, df.c3]).transform('sum').ge(2) df.loc[m] #5.21 ms ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit df.groupby(['c2','c3']).filter(lambda x: x['c1'].eq(1).sum() >= 2) #124 ms ± 714 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
May using groupby
+ merge
s = df.groupby(['c2', 'c3']).c1.sum().ge(2)
s[s].index.to_frame().reset_index(drop = True).merge(df, how = 'left')
c2 c3 c1
0 1 1 1
1 1 1 1
2 1 1 0
Use pandas DataFrame.groupby() to group the rows by column and use count() method to get the count for each group by ignoring None and Nan values. It works with non-floating type data as well. The below example does the grouping on Courses column and calculates count how many times each value is present.,You can use pandas DataFrame.groupby().count() to group columns and compute the count or size aggregate, this calculates a rows count for each group combination.,In this article, you have learned how to groupby single and multiple columns and get the rows counts from pandas DataFrame Using DataFrame.groupby(), size(), count() and DataFrame.transform() methods with examples.,If you are in a hurry, below are some quick examples of how to group by columns and get the count for each group from DataFrame.
# Below are quick examples # Using groupby() and count() df2 = df.groupby(['Courses'])['Courses'].count() # Using GroupBy & count() on multiple column df2 = df.groupby(['Courses', 'Duration'])['Fee'].count() # Using GroupBy & size() on multiple column df2 = df.groupby(['Courses', 'Duration'])['Fee'].size() # using DataFrame.size() and max() df2 = df.groupby(['Courses', 'Duration']).size().groupby(level = 0).max() # Use size().reset_index() method df2 = df.groupby(['Courses', 'Duration']).size().reset_index(name = 'counts') # Using pandas DataFrame.reset_index() df2 = df.groupby(['Courses', 'Duration'])['Fee'].agg('count').reset_index() # Using DataFrame.transform() df2 = df.groupby(['Courses', 'Duration']).Courses.transform('count') # Use DataFrame.groupby() and Size() print(df.groupby(['Discount', 'Duration']).size() .sort_values(ascending = False) .reset_index(name = 'count') .drop_duplicates(subset = 'Duration'))
Now, let’s create a DataFrame with a few rows and columns, execute these examples and validate results. Our DataFrame contains column names Courses
, Fee
, Duration
, and Discount
.
# Create a pandas DataFrame. import pandas as pd technologies = ({ 'Courses': ["Spark", "PySpark", "Hadoop", "Python", "Pandas", "Hadoop", "Spark", "Python"], 'Fee': [22000, 25000, 23000, 24000, 26000, 25000, 25000, 22000], 'Duration': ['30days', '50days', '35days', '40days', '60days', '35days', '55days', '50days'], 'Discount': [1000, 2300, 1000, 1200, 2500, 1300, 1400, 1600] }) df = pd.DataFrame(technologies, columns = ['Courses', 'Fee', 'Duration', 'Discount']) print(df)
Yields below output.
Courses Fee Duration Discount 0 Spark 22000 30 days 1000 1 PySpark 25000 50 days 2300 2 Hadoop 23000 35 days 1000 3 Python 24000 40 days 1200 4 Pandas 26000 60 days 2500 5 Hadoop 25000 35 days 1300 6 Spark 25000 55 days 1400 7 Python 22000 50 days 1600
You can also send a list of columns you wanted group to groupby() method, using this you can apply a groupby on multiple columns and calculate a count over each combination group. For example, df.groupby(['Courses','Duration'])['Fee'].count()
does group on Courses
and Duration
column and finally calculates the count.
# Using groupby() & count() on multiple column df2 = df.groupby(['Courses', 'Duration'])['Fee'].count() print(df2)
Courses Duration Hadoop 35 days 2 Pandas 60 days 1 PySpark 50 days 1 Python 40 days 1 50 days 1 Spark 30 days 1 55 days 1 Name: Fee, dtype: int64
Elements from groups are filtered if they do not satisfy the boolean criterion specified by func.,Return a copy of a DataFrame excluding filtered elements.,Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See Mutating with User Defined Function (UDF) methods for more details.,Each subframe is endowed the attribute ‘name’ in case you need to know which group you are working on.
>>> df = pd.DataFrame({
'A': ['foo', 'bar', 'foo', 'bar',
...'foo', 'bar'
],
...'B': [1, 2, 3, 4, 5, 6],
...'C': [2.0, 5., 8., 1., 2., 9.]
}) >>>
grouped = df.groupby('A') >>>
grouped.filter(lambda x: x['B'].mean() > 3.)
A B C
1 bar 2 5.0
3 bar 4 1.0
5 bar 6 9.0
Pandas groupby is a great way to group values of a dataframe on one or more column values. When performing such operations, it might happen that you need to know the number of rows in each group. In this tutorial, we will look at how to count the number of rows in each group of a pandas groupby object.,Pandas Groupby – Count of rows in each group,Let’s group the above dataframe on the column “Team” and get the number of rows in each group using the groupby size() function.,You can also use the pandas groupby count() function which gives the “count” of values in each column for each group. For example, let’s group the dataframe df on the “Team” column and apply the count() function.
You can use the pandas groupby size()
function to count the number of rows in each group of a groupby object. The following is the syntax:
df.groupby('Col1').size()
Let’s look at some examples of counting the number of rows in each group of a pandas groupby object. First, we will create a sample dataframe that we will be using throughout this tutorial for demonstrating the usage.
import pandas as pd # create a dataframe df = pd.DataFrame({ 'Team': ['A', 'A', 'B', 'B', 'B', 'C'], 'Points': [15, 18, 11, 17, 10, 13] }) # display the dataframe print(df)
Output:
Team Points
0 A 15
1 A 18
2 B 11
3 B 17
4 B 10
5 C 13
You can also use the pandas groupby count()
function which gives the “count” of values in each column for each group. For example, let’s group the dataframe df on the “Team” column and apply the count()
function.
# count in each group print(df.groupby('Team').count())
We get a dataframe of counts of values for each group and each column. Note that counts are similar to the row sizes we got above. This is because there are no NaN values present in the dataframe. Alternatively, you can use the pandas value_counts()
function if you’re grouping by a single column and want the counts.
# using value_counts()
print(df['Team'].value_counts())
Understanding your data’s shape with Pandas count and value_counts Pandas value_counts method ,In this post, we learned about groupby, count, and value_counts – three of the main methods in Pandas.,Exploring your Pandas DataFrame with counts and value_counts.,This is where the Pandas groupby method is useful. You can use groupby to chunk up your data into subsets for further analysis.
>>> df date symbol open high low close volume 0 2019 - 03 - 01 AMZN 1655.13 1674.26 1651.00 1671.73 4974877 1 2019 - 03 - 04 AMZN 1685.00 1709.43 1674.36 1696.17 6167358 2 2019 - 03 - 05 AMZN 1702.95 1707.80 1689.01 1692.43 3681522 3 2019 - 03 - 06 AMZN 1695.97 1697.75 1668.28 1668.95 3996001 4 2019 - 03 - 07 AMZN 1667.37 1669.75 1620.51 1625.95 4957017 5 2019 - 03 - 01 AAPL 174.28 175.15 172.89 174.97 25886167 6 2019 - 03 - 04 AAPL 175.69 177.75 173.97 175.85 27436203 7 2019 - 03 - 05 AAPL 175.94 176.00 174.54 175.53 19737419 8 2019 - 03 - 06 AAPL 174.67 175.49 173.94 174.52 20810384 9 2019 - 03 - 07 AAPL 173.87 174.44 172.02 172.50 24796374 10 2019 - 03 - 01 GOOG 1124.90 1142.97 1124.75 1140.99 1450316 11 2019 - 03 - 04 GOOG 1146.99 1158.28 1130.69 1147.80 1446047 12 2019 - 03 - 05 GOOG 1150.06 1169.61 1146.19 1162.03 1443174 13 2019 - 03 - 06 GOOG 1162.49 1167.57 1155.49 1157.86 1099289 14 2019 - 03 - 07 GOOG 1155.72 1156.76 1134.91 1143.30 1166559
>>>
import pandas as pd
>>>
import numpy as np >>>
url = 'https://gist.githubusercontent.com/alexdebrie/b3f40efc3dd7664df5a20f5eee85e854/raw/ee3e6feccba2464cbbc2e185fb17961c53d2a7f5/stocks.csv' >>>
df = pd.read_csv(url) >>>
df
date symbol open high low close volume
0 2019 - 03 - 01 AMZN 1655.13 1674.26 1651.00 1671.73 4974877
1 2019 - 03 - 04 AMZN 1685.00 1709.43 1674.36 1696.17 6167358
2 2019 - 03 - 05 AMZN 1702.95 1707.80 1689.01 1692.43 3681522
3 2019 - 03 - 06 AMZN 1695.97 1697.75 1668.28 1668.95 3996001
4 2019 - 03 - 07 AMZN 1667.37 1669.75 1620.51 1625.95 4957017
5 2019 - 03 - 01 AAPL 174.28 175.15 172.89 174.97 25886167
6 2019 - 03 - 04 AAPL 175.69 177.75 173.97 175.85 27436203
7 2019 - 03 - 05 AAPL 175.94 176.00 174.54 175.53 19737419
8 2019 - 03 - 06 AAPL 174.67 175.49 173.94 174.52 20810384
9 2019 - 03 - 07 AAPL 173.87 174.44 172.02 172.50 24796374
10 2019 - 03 - 01 GOOG 1124.90 1142.97 1124.75 1140.99 1450316
11 2019 - 03 - 04 GOOG 1146.99 1158.28 1130.69 1147.80 1446047
12 2019 - 03 - 05 GOOG 1150.06 1169.61 1146.19 1162.03 1443174
13 2019 - 03 - 06 GOOG 1162.49 1167.57 1155.49 1157.86 1099289
14 2019 - 03 - 07 GOOG 1155.72 1156.76 1134.91 1143.30 1166559
>>> symbols = df.groupby('symbol') >>>
print(symbols.groups) {
'AAPL': Int64Index([5, 6, 7, 8, 9], dtype = 'int64'),
'AMZN': Int64Index([0, 1, 2, 3, 4], dtype = 'int64'),
'GOOG': Int64Index([10, 11, 12, 13, 14], dtype = 'int64')
}
>>> def increased(idx):
...
return df.loc[idx].close > df.loc[idx].open
...
>>>
df.groupby(increased).groups {
False: Int64Index([2, 3, 4, 7, 8, 9, 13, 14], dtype = 'int64'),
True: Int64Index([0, 1, 5, 6, 10, 11, 12], dtype = 'int64')
}
>>> symbols['volume'].agg(np.mean)
symbol
AAPL 23733309.4
AMZN 4755355.0
GOOG 1321077.0
Name: volume, dtype: float64
>>>
for symbol, group in symbols:
...print(symbol)
...print(group)
...
AAPL
date symbol open high low close volume
5 2019 - 03 - 01 AAPL 174.28 175.15 172.89 174.97 25886167
6 2019 - 03 - 04 AAPL 175.69 177.75 173.97 175.85 27436203
7 2019 - 03 - 05 AAPL 175.94 176.00 174.54 175.53 19737419
8 2019 - 03 - 06 AAPL 174.67 175.49 173.94 174.52 20810384
9 2019 - 03 - 07 AAPL 173.87 174.44 172.02 172.50 24796374
AMZN
date symbol open high low close volume
0 2019 - 03 - 01 AMZN 1655.13 1674.26 1651.00 1671.73 4974877
1 2019 - 03 - 04 AMZN 1685.00 1709.43 1674.36 1696.17 6167358
2 2019 - 03 - 05 AMZN 1702.95 1707.80 1689.01 1692.43 3681522
3 2019 - 03 - 06 AMZN 1695.97 1697.75 1668.28 1668.95 3996001
4 2019 - 03 - 07 AMZN 1667.37 1669.75 1620.51 1625.95 4957017
GOOG
date symbol open high low close volume
10 2019 - 03 - 01 GOOG 1124.90 1142.97 1124.75 1140.99 1450316
11 2019 - 03 - 04 GOOG 1146.99 1158.28 1130.69 1147.80 1446047
12 2019 - 03 - 05 GOOG 1150.06 1169.61 1146.19 1162.03 1443174
13 2019 - 03 - 06 GOOG 1162.49 1167.57 1155.49 1157.86 1099289
14 2019 - 03 - 07 GOOG 1155.72 1156.76 1134.91 1143.30 1166559
Updated: January 7, 2022
import pandas as pd
import numpy as np
df = pd.DataFrame({
'continent': ['Asia', 'NorthAmerica', 'NorthAmerica', 'Europe', 'Europe', 'Europe', 'Asia', 'Europe', 'Asia'],
'country': ['China', 'USA', 'Canada', 'Poland', 'Romania', 'Italy', 'India', 'Germany', 'Russia'],
'GDP(trillion)': np.random.randint(1, 9, 9),
'Member_G20': np.random.choice(['Y', 'N'], 9)
})
df.groupby(['continent']).apply(lambda x: x[x['Member_G20'] == 'Y']['GDP(trillion)'].sum())
continent Asia 19 Europe 5 NorthAmerica 5 dtype: int64
grp = df.groupby(['continent'])
selected_group = grp.get_group('Europe')
selected_group
selected_group[selected_group['Member_G20'] == 'Y']
In Data Analysis we often aggregate our data and then typically apply specific functions on it. Today we’ll learn how to count values on data that we have previously aggregated using the DataFrame.groupby() Pandas method.,We can easily aggregate our dataset and count the number of observations related to each programming language in our dataset.,In this case, we will first go ahead and aggregate the data, and then count the number of unique distinct values. We will then sort the data in a descending orders. The result in this case is a series.,Let’s now assume that we want to show up only programming languages for which we interviewed more than twice during the year. We will first aggregate the data and then define a new column displaying the values we counted.
Let’s first import the Python Pandas library and acquire data into our Python development environment:
import pandas as pd
hr = pd.read_csv('interview_data.csv')
hr.info()
We can easily aggregate our dataset and count the number of observations related to each programming language in our dataset.
hr.groupby('language').size()
Let’s now assume that we want to show up only programming languages for which we interviewed more than twice during the year. We will first aggregate the data and then define a new column displaying the values we counted.
# groupby languages = hr.groupby('language').agg(number_of_months = ('month', 'count')) # define condition filt = languages['number_of_months'] > 2 # filter the DataFrame languages[filt]
In this case, we will first go ahead and aggregate the data, and then count the number of unique distinct values. We will then sort the data in a descending orders. The result in this case is a series.
hr.groupby('language')['month'].nunique().sort_values(ascending = False)
Using groupby in pandas to filter a dataframe using count and column value,Using itertools, melt and groupby correctly to count pairs of event per attribute value using Pandas,Iterate over index value pairs of pandas dataframe after groupby and count,Python : Group rows in dataframe and select abs max value in groups using pandas groupby
You can use your logic within a groupby
import pandas as pd
df = pd.DataFrame({
"ID": ['xyz', 'pqr', 'xyz', 'rst'],
"event_type": ['a', 'b', 'b', 'a']
})
what you are asking is this
df.groupby("ID")\
.apply(lambda x: not(len(x) == 1 and not "a" in x["event_type"].values))
as you can check by printing it. Finally to use this filter you just run
df = df.groupby("ID")\
.filter(lambda x: not(len(x) == 1 and not "a" in x["event_type"].values))\
.reset_index(drop = True)