You need first cast string
to int
by astype
, then groupby
with aggregating sum
and divide by div
by sum
. Last multiple 100
:
df = df.astype(int) a = df.groupby(level = 0).sum() print(a) pop1 pop2 female 3000 5000 male 7000 4000 b = df.sum() print(b) pop1 10000 pop2 9000 dtype: int64 print(a.div(b).mul(100)) pop1 pop2 female 30.0 55.555556 male 70.0 44.444444
It is same as:
df = df.astype(int) print(df.groupby(level = 0).sum().div(df.sum()).mul(100)) pop1 pop2 female 30.0 55.555556 male 70.0 44.444444
Here is a one liner:
(df.astype(int) / df.astype(int).sum()).groupby(level = 0).sum() * 100
It is a little prettier if you are already dealing with integers:
df = df.astype(int) (df / df.sum()).groupby(level = 0).sum() * 100
v = df.values.astype(int) pd.DataFrame( v / v.sum(0) * 100, df.index, df.columns ).groupby(level = 0).sum()
Put into words, after you convert the data into integers, you then divide each number by the total size of the relevant population, sum up those weights for each gender, and then multiply by 100 so the result looks like a percentage.,How to convert to uppercase the second and penultimate letter of a string list in Python using a lambda function and return a pandas DataFrame,create dataframe that number of rows equals to the sum of values in a column in original dataframe in python,Pandas : How to concatenate or merge the groups using groupby function and populate single table or dataframe?
Here is a one liner:
(df.astype(int) / df.astype(int).sum()).groupby(level = 0).sum() * 100
It is a little prettier if you are already dealing with integers:
df = df.astype(int) (df / df.sum()).groupby(level = 0).sum() * 100
v = df.values.astype(int) pd.DataFrame( v / v.sum(0) * 100, df.index, df.columns ).groupby(level = 0).sum()
You need first cast string
to int
by astype
, then groupby
with aggregating sum
and divide by div
by sum
. Last multiple 100
:
df = df.astype(int) a = df.groupby(level = 0).sum() print(a) pop1 pop2 female 3000 5000 male 7000 4000 b = df.sum() print(b) pop1 10000 pop2 9000 dtype: int64 print(a.div(b).mul(100)) pop1 pop2 female 30.0 55.555556 male 70.0 44.444444
It is same as:
df = df.astype(int) print(df.groupby(level = 0).sum().div(df.sum()).mul(100)) pop1 pop2 female 30.0 55.555556 male 70.0 44.444444
A DataFrame may be grouped by a combination of columns and index levels by specifying the column names as strings and the index levels as pd.Grouper objects.,A string passed to groupby may refer to either a column or an index level. If a string matches both a column name and an index level name, a ValueError will be raised.,For DataFrame objects, a string indicating either a column name or an index level name to be used to group., Splitting an object into groups GroupBy sorting GroupBy dropna GroupBy object attributes GroupBy with MultiIndex Grouping DataFrame with Index levels and columns DataFrame column selection in GroupBy
SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2
In[1]: df = pd.DataFrame(
...: [
...: ("bird", "Falconiformes", 389.0),
...: ("bird", "Psittaciformes", 24.0),
...: ("mammal", "Carnivora", 80.2),
...: ("mammal", "Primates", np.nan),
...: ("mammal", "Carnivora", 58),
...:
],
...: index = ["falcon", "parrot", "lion", "monkey", "leopard"],
...: columns = ("class", "order", "max_speed"),
...: )
...:
In[2]: df
Out[2]:
class order max_speed
falcon bird Falconiformes 389.0
parrot bird Psittaciformes 24.0
lion mammal Carnivora 80.2
monkey mammal Primates NaN
leopard mammal Carnivora 58.0
#
default is axis = 0
In[3]: grouped = df.groupby("class")
In[4]: grouped = df.groupby("order", axis = "columns")
In[5]: grouped = df.groupby(["class", "order"])
In[6]: df = pd.DataFrame(
...: {
...: "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
...: "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
...: "C": np.random.randn(8),
...: "D": np.random.randn(8),
...:
}
...: )
...:
In[7]: df
Out[7]:
A B C D
0 foo one 0.469112 - 0.861849
1 bar one - 0.282863 - 2.104569
2 foo two - 1.509059 - 0.494929
3 bar three - 1.135632 1.071804
4 foo two 1.212112 0.721555
5 bar two - 0.173215 - 0.706771
6 foo one 0.119209 - 1.039575
7 foo three - 1.044236 0.271860
In[8]: grouped = df.groupby("A")
In[9]: grouped = df.groupby(["A", "B"])
In[10]: df2 = df.set_index(["A", "B"])
In[11]: grouped = df2.groupby(level = df2.index.names.difference(["B"]))
In[12]: grouped.sum()
Out[12]:
C D
A
bar - 1.591710 - 1.739537
foo - 0.752861 - 1.402938
In[13]: def get_letter_type(letter):
....: if letter.lower() in 'aeiou':
....: return 'vowel'
....:
else:
....: return 'consonant'
....:
In[14]: grouped = df.groupby(get_letter_type, axis = 1)
Similar to the SQL GROUP BY clause pandas DataFrame.groupby() function is used to collect the identical data into groups and perform aggregate functions on the grouped data. Group by operation involves splitting the data, applying some functions, and finally aggregating the results.,Below is the syntax of the groupby() function, this function takes several params that are explained below and returns DataFrameGroupBy object that contains information about the groups.,As I said above groupby() function returns DataFrameGroupBy object after grouping the data on pandas DataFrame. This object contains several methods (sum(), mean() e.t.c) that can be used to aggregate the grouped rows.,You can also compute several aggregations at the same time in pandas by passing the list of agg functions to the aggregate().
# Syntax of DataFrame.groupby()
DataFrame.groupby(by=None, axis=0, level=None, as_index=True,
sort=True, group_keys=True, squeeze=<no_default>,
observed=False, dropna=True)
In order to explain several examples of how to perform group by, first, let’s create a simple DataFrame with the combination of string and numeric columns.
import pandas as pd
technologies = ({
'Courses': ["Spark", "PySpark", "Hadoop", "Python", "Pandas", "Hadoop", "Spark", "Python", "NA"],
'Fee': [22000, 25000, 23000, 24000, 26000, 25000, 25000, 22000, 1500],
'Duration': ['30days', '50days', '55days', '40days', '60days', '35days', '30days', '50days', '40days'],
'Discount': [1000, 2300, 1000, 1200, 2500, None, 1400, 1600, 0]
})
df = pd.DataFrame(technologies)
print(df)
Yields below output.
Courses Fee Duration Discount 0 Spark 22000 30 days 1000.0 1 PySpark 25000 50 days 2300.0 2 Hadoop 23000 55 days 1000.0 3 Python 24000 40 days 1200.0 4 Pandas 26000 60 days 2500.0 5 Hadoop 25000 35 days NaN 6 Spark 25000 30 days 1400.0 7 Python 22000 50 days 1600.0 8 NA 1500 40 days 0.0
Most of the time we would need to perform groupby on multiple columns of DataFrame, you can do this by passing a list of column labels you wanted to perform group by on.
# Group by multiple columns df2 = df.groupby(['Courses', 'Duration']).sum() print(df2)
Fee Discount Courses Duration Hadoop 35 days 25000 0.0 55 days 23000 1000.0 NA 40 days 1500 0.0 Pandas 60 days 26000 2500.0 PySpark 50 days 25000 2300.0 Python 40 days 24000 1200.0 50 days 22000 1600.0 Spark 30 days 47000 2400.0
Pandas, Python ,Python @Property Explained – How to Use and When? (Full Examples),Understanding Standard Error – A practical guide with examples,101 Pandas Exercises for Data Analysis
Create a simple dataframe as shown below with details of employees of different departments
# Create DataFrame import pandas as pd # Create the data of the DataFrame as a dictionary data_df = { 'Name': ['Asha', 'Harsh', 'Sourav', 'Riya', 'Hritik', 'Shivansh', 'Rohan', 'Akash', 'Soumya', 'Kartik' ], 'Department': ['Administration', 'Marketing', 'Technical', 'Technical', 'Marketing', 'Administration', 'Technical', 'Marketing', 'Technical', 'Administration' ], 'Employment Type': ['Full-time Employee', 'Intern', 'Intern', 'Part-time Employee', 'Part-time Employee', 'Full-time Employee', 'Full-time Employee', 'Intern', 'Intern', 'Full-time Employee' ], 'Salary': [120000, 50000, 70000, 70000, 55000, 120000, 125000, 60000, 50000, 120000 ], 'Years of Experience': [5, 1, 2, 3, 4, 7, 6, 2, 1, 6 ] } # Create the DataFrame df = pd.DataFrame(data_df) df
Now, use groupby
function to group the data as per the ‘Department’ type as shown below.
# Use pandas groupby to group rows by department and get only employees of technical department df_grouped = df.groupby('Department') df_grouped.get_group('Technical')
Let us say you want to find the average salary of different departments, then take the ‘Salary’ column from the grouped df and take the mean.
# Group by department and find average salary of each group df.groupby('Department')['Salary'].mean()
The output will be a dictionary where the keys of the dictionary are the group keys and the values of each key will be row index labels that have the same group key value.
# View the indices of the rows which are in the same group print(groups.groups)
# View the indices of the rows which are in the same group
print(groups.groups)
# > { 'Administration': [0, 5, 9], 'Marketing': [1, 4, 7], 'Technical': [2, 3, 6, 8] }