percentage of sum in dataframe pandas

  • Last Update :
  • Techknowledgy :

A Percentage is calculated by the mathematical formula of dividing the value by the sum of all the values and then multiplying the sum by 100. This is also applicable in Pandas Dataframes. Here, the pre-defined sum() method of pandas series is used to compute the sum of all the values of a column.,Analysis of AlgorithmsAsymptotic AnalysisWorst, Average and Best CasesAsymptotic NotationsLittle o and little omega notationsLower and Upper Bound TheoryAnalysis of LoopsSolving RecurrencesAmortized AnalysisWhat does 'Space Complexity' mean ?Pseudo-polynomial AlgorithmsPolynomial Time Approximation SchemeA Time Complexity Question,AlgorithmsAnalysis of AlgorithmsAsymptotic AnalysisWorst, Average and Best CasesAsymptotic NotationsLittle o and little omega notationsLower and Upper Bound TheoryAnalysis of LoopsSolving RecurrencesAmortized AnalysisWhat does 'Space Complexity' mean ?Pseudo-polynomial AlgorithmsPolynomial Time Approximation SchemeA Time Complexity QuestionSearching AlgorithmsSorting AlgorithmsGraph AlgorithmsPattern SearchingGeometric AlgorithmsMathematicalBitwise AlgorithmsRandomized AlgorithmsGreedy AlgorithmsDynamic ProgrammingDivide and ConquerBacktrackingBranch and BoundAll Algorithms,WriteCome write articles for us and get featuredPracticeLearn and code with the best industry expertsPremiumGet access to ad-free content, doubt assistance and more!JobsCome and find your dream job with usGeeks DigestQuizzesGeeks CampusGblog ArticlesIDECampus Mantri

Formula: 

df[percent] = (df['column_name'] / df['column_name'].sum()) * 100

Suggestion : 2

This answer by caner using transform looks much better than my original answer!

df['sales'] / df.groupby('state')['sales'].transform('sum')

Paul H's answer is right that you will have to make a second groupby object, but you can calculate the percentage in a simpler way -- just groupby the state_office and divide the sales column by its sum. Copying the beginning of Paul H's answer:

# From Paul H
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({
   'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
   'office_id': list(range(1, 7)) * 2,
   'sales': [np.random.randint(100000, 999999)
      for _ in range(12)
   ]
})
state_office = df.groupby(['state', 'office_id']).agg({
   'sales': 'sum'
})
# Change: groupby state_office and divide by sum
state_pcts = state_office.groupby(level = 0).apply(lambda x:
   100 * x / float(x.sum()))

Returns:

                     sales
                     state office_id
                     AZ 2 16.981365
                     4 19.250033
                     6 63.768601
                     CA 1 19.331879
                     3 33.858747
                     5 46.809373
                     CO 1 36.851857
                     3 19.874290
                     5 43.273852
                     WA 2 34.707233
                     4 35.511259
                     6 29.781508

You need to make a second groupby object that groups by the states, and then use the div method:

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({
   'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
   'office_id': list(range(1, 7)) * 2,
   'sales': [np.random.randint(100000, 999999) for _ in range(12)]
})

state_office = df.groupby(['state', 'office_id']).agg({
   'sales': 'sum'
})
state = df.groupby(['state']).agg({
   'sales': 'sum'
})
state_office.div(state, level = 'state') * 100

sales
state office_id
AZ 2 16.981365
4 19.250033
6 63.768601
CA 1 19.331879
3 33.858747
5 46.809373
CO 1 36.851857
3 19.874290
5 43.273852
WA 2 34.707233
4 35.511259
6 29.781508

So using transformation, the solution is 1-liner:

df['%'] = 100 * df['sales'] / df.groupby('state')['sales'].transform('sum')

And if you print:

print(df.sort_values(['state', 'office_id']).reset_index(drop = True))

state office_id sales %
   0 AZ 2 195197 9.844309
1 AZ 4 877890 44.274352
2 AZ 6 909754 45.881339
3 CA 1 614752 50.415708
4 CA 3 395340 32.421767
5 CA 5 209274 17.162525
6 CO 1 549430 42.659629
7 CO 3 457514 35.522956
8 CO 5 280995 21.817415
9 WA 2 828238 35.696929
10 WA 4 719366 31.004563
11 WA 6 772590 33.298509

For conciseness I'd use the SeriesGroupBy:

In[11]: c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")

In[12]: c
Out[12]:
   state office_id
AZ 2 925105
4 592852
6 362198
CA 1 819164
3 743055
5 292885
CO 1 525994
3 338378
5 490335
WA 2 623380
4 441560
6 451428
Name: count, dtype: int64

In[13]: c / c.groupby(level = 0).sum()
Out[13]:
   state office_id
AZ 2 0.492037
4 0.315321
6 0.192643
CA 1 0.441573
3 0.400546
5 0.157881
CO 1 0.388271
3 0.249779
5 0.361949
WA 2 0.411101
4 0.291196
6 0.297703
Name: count, dtype: float64

For multiple groups you have to use transform (using Radical's df):

In[21]: c = df.groupby(["Group 1", "Group 2", "Final Group"])["Numbers I want as percents"].sum().rename("count")

In[22]: c / c.groupby(level = [0, 1]).transform("sum")
Out[22]:
   Group 1 Group 2 Final Group
AAHQ BOSC OWON 0.331006
TLAM 0.668994
MQVF BWSI 0.288961
FXZM 0.711039
ODWV NFCH 0.262395
   ...
   Name: count, dtype: float64

I think this needs benchmarking. Using OP's original DataFrame,

df = pd.DataFrame({
   'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
   'office_id': range(1, 7) * 2,
   'sales': [np.random.randint(100000, 999999) for _ in range(12)]
})

As commented on his answer, Andy takes full advantage of vectorisation and pandas indexing.

c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
c / c.groupby(level = 0).sum()

2nd Paul H

state_office = df.groupby(['state', 'office_id']).agg({
   'sales': 'sum'
})
state = df.groupby(['state']).agg({
   'sales': 'sum'
})
state_office.div(state, level = 'state') * 100

Adding to the above code, we make a DataFrame with shape (12,000,000, 3) with 14412 state categories and 600 office_ids,

import string

import numpy as np
import pandas as pd
np.random.seed(0)

groups = [
   ''.join(i) for i in zip(
      np.random.choice(np.array([i
         for i in string.ascii_lowercase
      ]), 30000),
      np.random.choice(np.array([i
         for i in string.ascii_lowercase
      ]), 30000),
      np.random.choice(np.array([i
         for i in string.ascii_lowercase
      ]), 30000),
   )
]

df = pd.DataFrame({
   'state': groups * 400,
   'office_id': list(range(1, 601)) * 20000,
   'sales': [np.random.randint(100000, 999999)
      for _ in range(12)
   ] * 1000000
})

The following snippet fulfills these criteria:

df['sales_ratio'] = df.groupby(['state'])['sales'].transform(lambda x: x / x.sum())

Suggestion : 3

Below are Complete examples to caluclate percentage with groupby of pandas DataFrame.,In this article, You can find out how to calculate the percentage total of pandas DataFrame with some below examples.,You can calculate the percentage by using DataFrame.groupby() method. It is a process involving one or more of the following steps.,3. Using groupby with DataFrame.transform() Method

1._
# Below are some quick examples.
# Using DataFrame.agg() Method.
df2 = df.groupby(['Courses', 'Fee']).agg({
   'Fee': 'sum'
})

# Percentage by lambda and DataFrame.apply() method.
df3 = df2.groupby(level = 0).apply(lambda x: 100 * x / float(x.sum()))

# Using DataFrame.div() method.
df2 = df.groupby(['Courses', 'Fee']).agg({
   'Fee': 'sum'
})
Courses = df.groupby(['Courses']).agg({
   'Fee': 'sum'
})
df2.div(Courses, level = 'Courses') * 100

# Using groupby with DataFrame.rename() Method.
df2 = df.groupby(['Courses', 'Fee'])['Fee'].sum().rename("count")

# Using DataFrame.transform() method.
df['%'] = 100 * df['Fee'] / df.groupby('Courses')['Fee'].transform('sum')

# Alternative method of DataFrame.transform() by lambda functions.
df['Courses_Fee'] = df.groupby(['Courses'])['Fee'].transform(lambda x: x / x.sum())

# Caluclate groupby with DataFrame.rename() and DataFrame.transform() with lambda functions.
df2 = df.groupby(['Courses', 'Fee'])['Fee'].sum().rename("Courses_fee").groupby(level = 0).transform(lambda x: x / x.sum())

Now, Let’s create a pandas DataFrame with a few rows and columns, execute these examples and validate results that calculate the percentage total of pandas DataFrame.

# Create a Pandas DataFrame.
import pandas as pd
import numpy as np
technologies = {
   'Courses': ["Spark", "PySpark", "Spark", "Python", "PySpark"],
   'Fee': [22000, 25000, 23000, 24000, 26000],
   'Duration': ['30days', '50days', '30days', None, np.nan]
}
df = pd.DataFrame(technologies)
print(df)

Yields below output.

Courses Fee Duration
0 Spark 22000 30 days
1 PySpark 25000 50 days
2 Spark 23000 30 days
3 Python 24000 60 days
4 PySpark 26000 35 days
6._
# Percentage by lambda and DataFrame.apply() method.
df3 = df2.groupby(level = 0).apply(lambda x: 100 * x / float(x.sum()))
print(df3)

Another method to calculate total percentage with groupby by using DataFrame.div() method. Here div tells pandas to join the DataFrame based on the values in the Courses level of the index.

# Using DataFrame.div() method.
df2 = df.groupby(['Courses', 'Fee']).agg({
   'Fee': 'sum'
})
Courses = df.groupby(['Courses']).agg({
   'Fee': 'sum'
})
df3 = df2.div(Courses, level = 'Courses') * 100
print(df3)

Suggestion : 4

Percentage of a column in pandas python is carried out using sum() function in roundabout way. Let’s see how to,Percentage of a column in pandas dataframe is computed using sum() function and stored in a new column namely percentage as shown below,Get the percentage of a column in pandas dataframe in python With an example,Percentile rank of a column in pandas python - (percentile…

First let’s create a dataframe.

import pandas as pd
import numpy as np

#Create a DataFrame
df1 = {
   'Name': ['George', 'Andrea', 'micheal', 'maggie', 'Ravi', 'Xien', 'Jalpa'],
   'Mathematics_score': [62, 47, 55, 74, 32, 77, 86]
}

df1 = pd.DataFrame(df1, columns = ['Name', 'Mathematics_score'])
print(df1)

Percentage of a column in pandas dataframe is computed using sum() function and stored in a new column namely percentage as shown below

df1['percentage'] = df1['Mathematics_score'] / df1['Mathematics_score'].sum()
print(df1)

Suggestion : 5

Computes the percentage change from the immediately previous row by default. This is useful in comparing the percentage of change in a time series of elements.,Percentage change between the current and a prior element.,Percentage of change in GOOG and APPL stock volume. Shows computing the percentage change between columns.,See the percentage change in a Series where filling NAs with last valid observation forward to next valid.

>>> s = pd.Series([90, 91, 85]) >>>
   s
0 90
1 91
2 85
dtype: int64
>>> s.pct_change()
0 NaN
1 0.011111
2 - 0.065934
dtype: float64
>>> s.pct_change(periods = 2)
0 NaN
1 NaN
2 - 0.055556
dtype: float64
>>> s = pd.Series([90, 91, None, 85]) >>>
   s
0 90.0
1 91.0
2 NaN
3 85.0
dtype: float64
>>> s.pct_change(fill_method = 'ffill')
0 NaN
1 0.011111
2 0.000000
3 - 0.065934
dtype: float64
>>> df = pd.DataFrame({
         ...'FR': [4.0405, 4.0963, 4.3149],
         ...'GR': [1.7246, 1.7482, 1.8519],
         ...'IT': [804.74, 810.01, 860.13]
      },
      ...index = ['1980-01-01', '1980-02-01', '1980-03-01']) >>>
   df
FR GR IT
1980 - 01 - 01 4.0405 1.7246 804.74
1980 - 02 - 01 4.0963 1.7482 810.01
1980 - 03 - 01 4.3149 1.8519 860.13