You’ll also see that your grouping column is now the dataframe’s index. Reset your index to make this easier to work with later on.,Pandas comes with a whole host of sql-like aggregation functions you can apply when grouping on one or more columns. This is Python’s closest equivalent to dplyr’s group_by + summarise logic. Here’s a quick example of how to group on one or multiple columns and summarise data with aggregation functions using Pandas.,Applying multiple aggregation functions to a single column will result in a multiindex. Working with multi-indexed columns is a pain and I’d recommend flattening this after aggregating by renaming the new columns.,It’s simple to extend this to work with multiple grouping variables. Say you want to summarise player age by team AND position. You can do this by passing a list of column names to groupby instead of a single string value.
import pandas as pd
data = {
"Team": ["Red Sox", "Red Sox", "Red Sox", "Red Sox", "Red Sox", "Red Sox", "Yankees", "Yankees", "Yankees", "Yankees", "Yankees", "Yankees"],
"Pos": ["Pitcher", "Pitcher", "Pitcher", "Not Pitcher", "Not Pitcher", "Not Pitcher", "Pitcher", "Pitcher", "Pitcher", "Not Pitcher", "Not Pitcher", "Not Pitcher"],
"Age": [24, 28, 40, 22, 29, 33, 31, 26, 21, 36, 25, 31]
}
df = pd.DataFrame(data)
print(df)
# group by Team, get mean, min, and max value of Age for each value of Team. grouped_single = df.groupby('Team').agg({ 'Age': ['mean', 'min', 'max'] }) print(grouped_single)
# rename columns grouped_single.columns = ['age_mean', 'age_min', 'age_max'] # reset index to get grouped columns back grouped_single = grouped_single.reset_index() print(grouped_single)
grouped_multiple = df.groupby(['Team', 'Pos']).agg({
'Age': ['mean', 'min', 'max']
})
grouped_multiple.columns = ['age_mean', 'age_min', 'age_max']
grouped_multiple = grouped_multiple.reset_index()
print(grouped_multiple)
df.groupby('Category').agg({
'Item': 'size',
'shop1': ['sum', 'mean', 'std'],
'shop2': ['sum', 'mean', 'std'],
'shop3': ['sum', 'mean', 'std']
})
Or if you want it across all shops then:
df1 = df.set_index(['Item', 'Category']).stack().reset_index().rename(columns = {
'level_2': 'Shops',
0: 'costs'
})
df1.groupby('Category').agg({
'Item': 'size',
'costs': ['sum', 'mean', 'std']
})
We set up a very similar dictionary where we use the keys of the dictionary to specify our functions and the dictionary itself to rename the columns.
rnm_cols = dict(size = 'Size', sum = 'Sum', mean = 'Mean', std = 'Std')
df.set_index(['Category', 'Item']).stack().groupby('Category')\
.agg(rnm_cols.keys()).rename(columns = rnm_cols)
Size Sum Mean Std
Category
Books 3 58 19.333333 2.081666
Clothes 3 148 49.333333 4.041452
Technology 6 1800 300.000000 70.710678
option 1
use agg
← link to docs
agg_funcs = dict(Size = 'size', Sum = 'sum', Mean = 'mean', Std = 'std')
df.set_index(['Category', 'Item']).stack().groupby(level = 0).agg(agg_funcs)
Std Sum Mean Size
Category
Books 2.081666 58 19.333333 3
Clothes 4.041452 148 49.333333 3
Technology 70.710678 1800 300.000000 6
option 2
more for less
use describe
← link to docs
df.set_index(['Category', 'Item']).stack().groupby(level = 0).describe().unstack()
count mean std min 25 % 50 % 75 % max
Category
Books 3.0 19.333333 2.081666 17.0 18.5 20.0 20.5 21.0
Clothes 3.0 49.333333 4.041452 45.0 47.5 50.0 51.5 53.0
Technology 6.0 300.000000 70.710678 200.0 262.5 300.0 337.5 400.0
If I understand correctly, you want to calculate aggregate metrics for all shops, not for each individually. To do that, you can first stack
your dataframe and then group by Category
:
stacked = df.set_index(['Item', 'Category']).stack().reset_index()
stacked.columns = ['Item', 'Category', 'Shop', 'Price']
stacked.groupby('Category').agg({
'Price': ['count', 'sum', 'mean', 'std']
})
Which results in
Price count sum mean std Category Books 3 58 19.333333 2.081666 Clothes 3 148 49.333333 4.041452 Technology 6 1800 300.000000 70.710678
How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.,You can also compute multiple aggregations at the same time in pandas by using the list to the aggregate().,Following are examples of how to groupby on multiple columns & apply multiple aggregations.,Most of the time when you are working on a real-time project in pandas DataFrame you are required to do groupby on multiple columns. You can do so by passing a list of column names to DataFrame.groupby() function. Let’s create a DataFrame to understand this with examples.
Following are examples of how to groupby on multiple columns & apply multiple aggregations.
# Quick Examples # Groupby multiple columns result = df.groupby(['Courses', 'Fee']).count() print(result) # Groupby multiple columns and aggregate on selected column result = df.groupby(['Courses', 'Fee'])['Courses'].count() print(result) # Groupby multiple columns and aggregate() result = df.groupby(['Courses', 'Fee'])['Duration'].aggregate('count') print(result) # Groupby multiple aggregations result = df.groupby('Courses')['Fee'].aggregate(['min', 'max']) print(result) # Groupby & multiple aggregations on different columns result = df.groupby('Courses').aggregate({ 'Duration': 'count', 'Fee': ['min', 'max'] }) print(result)
import pandas as pd
technologies = {
'Courses': ["Spark", "PySpark", "Hadoop", "Python", "PySpark", "Spark", "Spark"],
'Fee': [20000, 25000, 26000, 22000, 25000, 20000, 35000],
'Duration': ['30day', '40days', '35days', '40days', '60days', '60days', '70days'],
'Discount': [1000, 2300, 1200, 2500, 2000, 2000, 3000]
}
df = pd.DataFrame(technologies)
print(df)
Yields below output.
Courses Fee Duration Discount 0 Spark 20000 30 day 1000 1 PySpark 25000 40 days 2300 2 Hadoop 26000 35 days 1200 3 Python 22000 40 days 2500 4 PySpark 25000 60 days 2000 5 Spark 20000 60 days 2000 6 Spark 35000 70 days 3000
Duration Discount Courses Fee Hadoop 26000 1 1 PySpark 25000 2 2 Python 22000 1 1 Spark 20000 2 2 35000 1 1
So when you want group by count just select a column, you can event select from your group columns.
# Groupby multiple columns result = df.groupby(['Courses', 'Fee'])['Courses'].count() print(result)
In today’s post we would like to show how to use the DataFrame Groupby method in Pandas in order to aggregate data by one or multiple column values.,In this case we would like to show multiple aggregations (in our case min, mean and max) for the same column. Here is the Python code:,We’ll first aggregate the number of candidates by month. In order to do so we’ll create a new DataFrame that contains the aggregated value. We’ll also assign the num_candidates name to the newly created aggregating column.,Can you groupby your data set multiple columns in Pandas? You bet!
Let’s assume we have a very simple Data set that consists in some HR related information that we’ll be using throughout this tutorial.
import pandas as pd
candidates_df = pd.read_csv('candidates')
Let’s take a look at our DataFrame:
print(candidates_df)
We’ll first aggregate the number of candidates by month. In order to do so we’ll create a new DataFrame that contains the aggregated value. We’ll also assign the num_candidates name to the newly created aggregating column.
candidates_by_month = candidates_df.groupby('month').agg(num_cand_month = ('num_candidates', 'sum'))
print(candidates_by_month)
Note: We could as well pass a dictionary containing the column to aggregate and the functions to use. In our case:
candidates_salary_by_month = candidates_df.groupby('month')\
.agg({
'salary': ['min', 'mean', 'max']
})
Here’s an example of multiple aggregations per grouping, each with their specific calculated function: a sum of the aggregating column and an average calculation.
# multiple columns candidates_month_languages = candidates_df.groupby(['language', 'month'])\ .agg(num_cand_month = ('num_candidates', 'sum'), avg_sal = ('salary', 'mean')) print(candidates_month_languages)
Multiple Statistics per GroupApplying a single function to columns in groupsApplying multiple functions to columns in groups,To apply multiple functions to a single column in your grouped data, expand the syntax above to pass in a list of functions as the value in your aggregation dataframe. See below:,Applying a single function to columns in groups,One aspect that I’ve recently been exploring is the task of grouping large data frames by different variables, and applying summary functions on each group. This is accomplished in Pandas using the “groupby()” and “agg()” functions of Panda’s DataFrame objects.
Phone numbers were removed for privacy. The date column can be parsed using the extremely handy dateutil library.
import pandas as pd import dateutil # Load data from csv file data = pd.DataFrame.from_csv('phone_data.csv') # Convert date from string to date times data['date'] = data['date'].apply(dateutil.parser.parse, dayfirst = True)
Once the data has been loaded into Python, Pandas makes the calculation of different statistics very simple. For example, mean, max, min, standard deviations and more for columns are easily calculable:
# How many rows the dataset
data['item'].count()
Out[38]: 830
# What was the longest phone call / data entry ?
data['duration'].max()
Out[39]: 10528.0
# How many seconds of phone calls are recorded in total ?
data['duration'][data['item'] == 'call'].sum()
Out[40]: 92321.0
# How many entries are there
for each month ?
data['month'].value_counts()
Out[41]:
2014 - 11 230
2015 - 01 205
2014 - 12 157
2015 - 02 137
2015 - 03 101
dtype: int64
# Number of non - null unique network entries
data['network'].nunique()
Out[42]: 9
The groupby() function returns a GroupBy object, but essentially describes how the rows of the original data set has been split. the GroupBy object .groups variable is a dictionary whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group. For example:
data.groupby(['month']).groups.keys()
Out[59]: ['2014-12', '2014-11', '2015-02', '2015-03', '2015-01']
len(data.groupby(['month']).groups['2014-11'])
Out[61]: 230
You can also group by more than one variable, allowing more complex queries.
# How many calls, sms, and data entries are in each month ?
data.groupby(['month', 'item'])['date'].count()
Out[76]:
month item
2014 - 11 call 107
data 29
sms 94
2014 - 12 call 79
data 30
sms 48
2015 - 01 call 88
data 31
sms 86
2015 - 02 call 67
data 31
sms 39
2015 - 03 call 47
data 29
sms 25
Name: date, dtype: int64
# How many calls, texts, and data are sent per month, split by network_type ?
data.groupby(['month', 'network_type'])['date'].count()
Out[82]:
month network_type
2014 - 11 data 29
landline 5
mobile 189
special 1
voicemail 6
2014 - 12 data 30
landline 7
mobile 108
voicemail 8
world 4
2015 - 01 data 31
landline 11
mobile 160
....
You can change this by selecting your operation column differently:
# produces Pandas Series data.groupby('month')['duration'].sum() # Produces Pandas DataFrame data.groupby('month')[['duration']].sum()
basicaly, How can I do this SQL sentence:
SELECT(max(columnA) + sum(columnB)) * 0.4 as colResult, columnC FROM table GROUP BY columnD
The only way I found is
ddff = df.groupby(columnD).agg(colA = (columnA, max), colB = (columnB, sum))
and then
ddff["colResult"] = 0.4 * (ddff.colA + ddff.colB)
data.groupby(['month']).groups.keys()
throws an
AttributeError: 'generator'
object has no attribute 'keys'
Grouping DataFrame with Index levels and columns ,More on the sum function and aggregation later.,Regroup columns of a DataFrame according to their sum, and sum the aggregated ones.,With grouped Series you can also pass a list or dict of functions to do aggregation with, outputting a DataFrame:
SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2
In[1]: df = pd.DataFrame(
...: [
...: ("bird", "Falconiformes", 389.0),
...: ("bird", "Psittaciformes", 24.0),
...: ("mammal", "Carnivora", 80.2),
...: ("mammal", "Primates", np.nan),
...: ("mammal", "Carnivora", 58),
...:
],
...: index = ["falcon", "parrot", "lion", "monkey", "leopard"],
...: columns = ("class", "order", "max_speed"),
...: )
...:
In[2]: df
Out[2]:
class order max_speed
falcon bird Falconiformes 389.0
parrot bird Psittaciformes 24.0
lion mammal Carnivora 80.2
monkey mammal Primates NaN
leopard mammal Carnivora 58.0
#
default is axis = 0
In[3]: grouped = df.groupby("class")
In[4]: grouped = df.groupby("order", axis = "columns")
In[5]: grouped = df.groupby(["class", "order"])
In[6]: df = pd.DataFrame(
...: {
...: "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
...: "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
...: "C": np.random.randn(8),
...: "D": np.random.randn(8),
...:
}
...: )
...:
In[7]: df
Out[7]:
A B C D
0 foo one 0.469112 - 0.861849
1 bar one - 0.282863 - 2.104569
2 foo two - 1.509059 - 0.494929
3 bar three - 1.135632 1.071804
4 foo two 1.212112 0.721555
5 bar two - 0.173215 - 0.706771
6 foo one 0.119209 - 1.039575
7 foo three - 1.044236 0.271860
In[8]: grouped = df.groupby("A")
In[9]: grouped = df.groupby(["A", "B"])
In[10]: df2 = df.set_index(["A", "B"])
In[11]: grouped = df2.groupby(level = df2.index.names.difference(["B"]))
In[12]: grouped.sum()
Out[12]:
C D
A
bar - 1.591710 - 1.739537
foo - 0.752861 - 1.402938
In[13]: def get_letter_type(letter):
....: if letter.lower() in 'aeiou':
....: return 'vowel'
....:
else:
....: return 'consonant'
....:
In[14]: grouped = df.groupby(get_letter_type, axis = 1)