A Percentage is calculated by the mathematical formula of dividing the value by the sum of all the values and then multiplying the sum by 100. This is also applicable in Pandas Dataframes. Here, the pre-defined sum() method of pandas series is used to compute the sum of all the values of a column.,Analysis of AlgorithmsAsymptotic AnalysisWorst, Average and Best CasesAsymptotic NotationsLittle o and little omega notationsLower and Upper Bound TheoryAnalysis of LoopsSolving RecurrencesAmortized AnalysisWhat does 'Space Complexity' mean ?Pseudo-polynomial AlgorithmsPolynomial Time Approximation SchemeA Time Complexity Question,AlgorithmsAnalysis of AlgorithmsAsymptotic AnalysisWorst, Average and Best CasesAsymptotic NotationsLittle o and little omega notationsLower and Upper Bound TheoryAnalysis of LoopsSolving RecurrencesAmortized AnalysisWhat does 'Space Complexity' mean ?Pseudo-polynomial AlgorithmsPolynomial Time Approximation SchemeA Time Complexity QuestionSearching AlgorithmsSorting AlgorithmsGraph AlgorithmsPattern SearchingGeometric AlgorithmsMathematicalBitwise AlgorithmsRandomized AlgorithmsGreedy AlgorithmsDynamic ProgrammingDivide and ConquerBacktrackingBranch and BoundAll Algorithms,WriteCome write articles for us and get featuredPracticeLearn and code with the best industry expertsPremiumGet access to ad-free content, doubt assistance and more!JobsCome and find your dream job with usGeeks DigestQuizzesGeeks CampusGblog ArticlesIDECampus Mantri
Formula:
df[percent] = (df['column_name'] / df['column_name'].sum()) * 100
This answer by caner using transform
looks much better than my original answer!
df['sales'] / df.groupby('state')['sales'].transform('sum')
Paul H's answer is right that you will have to make a second groupby
object, but you can calculate the percentage in a simpler way -- just groupby
the state_office
and divide the sales
column by its sum. Copying the beginning of Paul H's answer:
# From Paul H import numpy as np import pandas as pd np.random.seed(0) df = pd.DataFrame({ 'state': ['CA', 'WA', 'CO', 'AZ'] * 3, 'office_id': list(range(1, 7)) * 2, 'sales': [np.random.randint(100000, 999999) for _ in range(12) ] }) state_office = df.groupby(['state', 'office_id']).agg({ 'sales': 'sum' }) # Change: groupby state_office and divide by sum state_pcts = state_office.groupby(level = 0).apply(lambda x: 100 * x / float(x.sum()))
Returns:
sales state office_id AZ 2 16.981365 4 19.250033 6 63.768601 CA 1 19.331879 3 33.858747 5 46.809373 CO 1 36.851857 3 19.874290 5 43.273852 WA 2 34.707233 4 35.511259 6 29.781508
You need to make a second groupby object that groups by the states, and then use the div
method:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({
'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]
})
state_office = df.groupby(['state', 'office_id']).agg({
'sales': 'sum'
})
state = df.groupby(['state']).agg({
'sales': 'sum'
})
state_office.div(state, level = 'state') * 100
sales
state office_id
AZ 2 16.981365
4 19.250033
6 63.768601
CA 1 19.331879
3 33.858747
5 46.809373
CO 1 36.851857
3 19.874290
5 43.273852
WA 2 34.707233
4 35.511259
6 29.781508
So using transformation
, the solution is 1-liner:
df['%'] = 100 * df['sales'] / df.groupby('state')['sales'].transform('sum')
And if you print:
print(df.sort_values(['state', 'office_id']).reset_index(drop = True))
state office_id sales %
0 AZ 2 195197 9.844309
1 AZ 4 877890 44.274352
2 AZ 6 909754 45.881339
3 CA 1 614752 50.415708
4 CA 3 395340 32.421767
5 CA 5 209274 17.162525
6 CO 1 549430 42.659629
7 CO 3 457514 35.522956
8 CO 5 280995 21.817415
9 WA 2 828238 35.696929
10 WA 4 719366 31.004563
11 WA 6 772590 33.298509
For conciseness I'd use the SeriesGroupBy:
In[11]: c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
In[12]: c
Out[12]:
state office_id
AZ 2 925105
4 592852
6 362198
CA 1 819164
3 743055
5 292885
CO 1 525994
3 338378
5 490335
WA 2 623380
4 441560
6 451428
Name: count, dtype: int64
In[13]: c / c.groupby(level = 0).sum()
Out[13]:
state office_id
AZ 2 0.492037
4 0.315321
6 0.192643
CA 1 0.441573
3 0.400546
5 0.157881
CO 1 0.388271
3 0.249779
5 0.361949
WA 2 0.411101
4 0.291196
6 0.297703
Name: count, dtype: float64
For multiple groups you have to use transform (using Radical's df):
In[21]: c = df.groupby(["Group 1", "Group 2", "Final Group"])["Numbers I want as percents"].sum().rename("count")
In[22]: c / c.groupby(level = [0, 1]).transform("sum")
Out[22]:
Group 1 Group 2 Final Group
AAHQ BOSC OWON 0.331006
TLAM 0.668994
MQVF BWSI 0.288961
FXZM 0.711039
ODWV NFCH 0.262395
...
Name: count, dtype: float64
I think this needs benchmarking. Using OP's original DataFrame,
df = pd.DataFrame({
'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': range(1, 7) * 2,
'sales': [np.random.randint(100000, 999999) for _ in range(12)]
})
As commented on his answer, Andy takes full advantage of vectorisation and pandas indexing.
c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
c / c.groupby(level = 0).sum()
2nd Paul H
state_office = df.groupby(['state', 'office_id']).agg({
'sales': 'sum'
})
state = df.groupby(['state']).agg({
'sales': 'sum'
})
state_office.div(state, level = 'state') * 100
Adding to the above code, we make a DataFrame with shape (12,000,000, 3) with 14412 state categories and 600 office_ids,
import string
import numpy as np
import pandas as pd
np.random.seed(0)
groups = [
''.join(i) for i in zip(
np.random.choice(np.array([i
for i in string.ascii_lowercase
]), 30000),
np.random.choice(np.array([i
for i in string.ascii_lowercase
]), 30000),
np.random.choice(np.array([i
for i in string.ascii_lowercase
]), 30000),
)
]
df = pd.DataFrame({
'state': groups * 400,
'office_id': list(range(1, 601)) * 20000,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)
] * 1000000
})
The following snippet fulfills these criteria:
df['sales_ratio'] = df.groupby(['state'])['sales'].transform(lambda x: x / x.sum())
Below are Complete examples to caluclate percentage with groupby of pandas DataFrame.,In this article, You can find out how to calculate the percentage total of pandas DataFrame with some below examples.,You can calculate the percentage by using DataFrame.groupby() method. It is a process involving one or more of the following steps.,3. Using groupby with DataFrame.transform() Method
# Below are some quick examples. # Using DataFrame.agg() Method. df2 = df.groupby(['Courses', 'Fee']).agg({ 'Fee': 'sum' }) # Percentage by lambda and DataFrame.apply() method. df3 = df2.groupby(level = 0).apply(lambda x: 100 * x / float(x.sum())) # Using DataFrame.div() method. df2 = df.groupby(['Courses', 'Fee']).agg({ 'Fee': 'sum' }) Courses = df.groupby(['Courses']).agg({ 'Fee': 'sum' }) df2.div(Courses, level = 'Courses') * 100 # Using groupby with DataFrame.rename() Method. df2 = df.groupby(['Courses', 'Fee'])['Fee'].sum().rename("count") # Using DataFrame.transform() method. df['%'] = 100 * df['Fee'] / df.groupby('Courses')['Fee'].transform('sum') # Alternative method of DataFrame.transform() by lambda functions. df['Courses_Fee'] = df.groupby(['Courses'])['Fee'].transform(lambda x: x / x.sum()) # Caluclate groupby with DataFrame.rename() and DataFrame.transform() with lambda functions. df2 = df.groupby(['Courses', 'Fee'])['Fee'].sum().rename("Courses_fee").groupby(level = 0).transform(lambda x: x / x.sum())
Now, Let’s create a pandas DataFrame with a few rows and columns, execute these examples and validate results that calculate the percentage total of pandas DataFrame.
# Create a Pandas DataFrame. import pandas as pd import numpy as np technologies = { 'Courses': ["Spark", "PySpark", "Spark", "Python", "PySpark"], 'Fee': [22000, 25000, 23000, 24000, 26000], 'Duration': ['30days', '50days', '30days', None, np.nan] } df = pd.DataFrame(technologies) print(df)
Yields below output.
Courses Fee Duration 0 Spark 22000 30 days 1 PySpark 25000 50 days 2 Spark 23000 30 days 3 Python 24000 60 days 4 PySpark 26000 35 days
# Percentage by lambda and DataFrame.apply() method. df3 = df2.groupby(level = 0).apply(lambda x: 100 * x / float(x.sum())) print(df3)
Another method to calculate total percentage with groupby
by using DataFrame.div()
method. Here div
tells pandas to join the DataFrame based on the values in the Courses
level of the index
.
# Using DataFrame.div() method. df2 = df.groupby(['Courses', 'Fee']).agg({ 'Fee': 'sum' }) Courses = df.groupby(['Courses']).agg({ 'Fee': 'sum' }) df3 = df2.div(Courses, level = 'Courses') * 100 print(df3)
Percentage of a column in pandas python is carried out using sum() function in roundabout way. Let’s see how to,Percentage of a column in pandas dataframe is computed using sum() function and stored in a new column namely percentage as shown below,Get the percentage of a column in pandas dataframe in python With an example,Percentile rank of a column in pandas python - (percentile…
First let’s create a dataframe.
import pandas as pd
import numpy as np
#Create a DataFrame
df1 = {
'Name': ['George', 'Andrea', 'micheal', 'maggie', 'Ravi', 'Xien', 'Jalpa'],
'Mathematics_score': [62, 47, 55, 74, 32, 77, 86]
}
df1 = pd.DataFrame(df1, columns = ['Name', 'Mathematics_score'])
print(df1)
Percentage of a column in pandas dataframe is computed using sum() function and stored in a new column namely percentage as shown below
df1['percentage'] = df1['Mathematics_score'] / df1['Mathematics_score'].sum()
print(df1)
Computes the percentage change from the immediately previous row by default. This is useful in comparing the percentage of change in a time series of elements.,Percentage change between the current and a prior element.,Percentage of change in GOOG and APPL stock volume. Shows computing the percentage change between columns.,See the percentage change in a Series where filling NAs with last valid observation forward to next valid.
>>> s = pd.Series([90, 91, 85]) >>> s 0 90 1 91 2 85 dtype: int64
>>> s.pct_change() 0 NaN 1 0.011111 2 - 0.065934 dtype: float64
>>> s.pct_change(periods = 2) 0 NaN 1 NaN 2 - 0.055556 dtype: float64
>>> s = pd.Series([90, 91, None, 85]) >>> s 0 90.0 1 91.0 2 NaN 3 85.0 dtype: float64
>>> s.pct_change(fill_method = 'ffill')
0 NaN
1 0.011111
2 0.000000
3 - 0.065934
dtype: float64
>>> df = pd.DataFrame({
...'FR': [4.0405, 4.0963, 4.3149],
...'GR': [1.7246, 1.7482, 1.8519],
...'IT': [804.74, 810.01, 860.13]
},
...index = ['1980-01-01', '1980-02-01', '1980-03-01']) >>>
df
FR GR IT
1980 - 01 - 01 4.0405 1.7246 804.74
1980 - 02 - 01 4.0963 1.7482 810.01
1980 - 03 - 01 4.3149 1.8519 860.13