pandas groupby+transform on 50 million rows is taking 3 hours

  • Last Update :
  • Techknowledgy :

Note for @Uri Goren. This is a constant memory process and only has 1 group in memory at a time. This will scale linearly with the number of groups. Sorting is also unecessary.

In [20]: np.random.seed(1234)

In [21]: ngroups = 1000

In [22]: nrows = 50000000

In [23]: dates = pd.date_range('20000101',freq='MS',periods=ngroups)

In [24]:  df = DataFrame({'account' : np.random.randint(0,ngroups,size=nrows),
                 'date' : dates.take(np.random.randint(0,ngroups,size=nrows)),
                 'values' : np.random.randn(nrows) })


In [25]: 

In [25]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000000 entries, 0 to 49999999
Data columns (total 3 columns):
account    int64
date       datetime64[ns]
values     float64
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 1.5 GB

In [26]: df.head()
Out[26]: 
   account       date    values
0      815 2048-02-01 -0.412587
1      723 2023-01-01 -0.098131
2      294 2020-11-01 -2.899752
3       53 2058-02-01 -0.469925
4      204 2080-11-01  1.389950

In [27]: %timeit df.groupby(['account','date']).sum()
1 loops, best of 3: 8.08 s per loop

If you want to transform the output, then doit like this

In[37]: g = df.groupby(['account', 'date'])['values']

In[38]: result = 100 * df['values'] / g.transform('sum')

In[41]: result.head()
Out[41]:
   0 4.688957
1 - 2.340621
2 - 80.042089
3 - 13.813078
4 - 70.857014
dtype: float64

In[43]: len(result)
Out[43]: 50000000

In[42]: % timeit 100 * df['values'] / g.transform('sum')
1 loops, best of 3: 30.9 s per loop

I would use a different approach First Sort,

MyDataFrame.sort(['account', 'month'], inplace = True)

Then iterate and sum

(account, month) = ('', '') #some invalid values
salary = 0.0
res = []
for index, row in MyDataFrame.iterrows():
   if (row['account'], row['month']) == (account, month):
      salary += row['salary']
else:
   res.append([account, month, salary])
salary = 0.0(account, month) = (row['account'], row['month'])
df = pd.DataFrame(res, columns = ['account', 'month', 'salary'])

Suggestion : 2

Pandas groupby+transform on 50 million rows is taking 3 hours,Pandas groupby + transform taking hours for 600 Million records,Working on 50 million rows in pandas (python),Fastest way to iterate over 70 million rows in pandas dataframe

Note for @Uri Goren. This is a constant memory process and only has 1 group in memory at a time. This will scale linearly with the number of groups. Sorting is also unecessary.

In [20]: np.random.seed(1234)

In [21]: ngroups = 1000

In [22]: nrows = 50000000

In [23]: dates = pd.date_range('20000101',freq='MS',periods=ngroups)

In [24]:  df = DataFrame({'account' : np.random.randint(0,ngroups,size=nrows),
                 'date' : dates.take(np.random.randint(0,ngroups,size=nrows)),
                 'values' : np.random.randn(nrows) })


In [25]: 

In [25]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000000 entries, 0 to 49999999
Data columns (total 3 columns):
account    int64
date       datetime64[ns]
values     float64
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 1.5 GB

In [26]: df.head()
Out[26]: 
   account       date    values
0      815 2048-02-01 -0.412587
1      723 2023-01-01 -0.098131
2      294 2020-11-01 -2.899752
3       53 2058-02-01 -0.469925
4      204 2080-11-01  1.389950

In [27]: %timeit df.groupby(['account','date']).sum()
1 loops, best of 3: 8.08 s per loop

If you want to transform the output, then doit like this

In[37]: g = df.groupby(['account', 'date'])['values']

In[38]: result = 100 * df['values'] / g.transform('sum')

In[41]: result.head()
Out[41]:
   0 4.688957
1 - 2.340621
2 - 80.042089
3 - 13.813078
4 - 70.857014
dtype: float64

In[43]: len(result)
Out[43]: 50000000

In[42]: % timeit 100 * df['values'] / g.transform('sum')
1 loops, best of 3: 30.9 s per loop

I would use a different approach First Sort,

MyDataFrame.sort(['account', 'month'], inplace = True)

Then iterate and sum

(account, month) = ('', '') #some invalid values
salary = 0.0
res = []
for index, row in MyDataFrame.iterrows():
   if (row['account'], row['month']) == (account, month):
      salary += row['salary']
else:
   res.append([account, month, salary])
salary = 0.0(account, month) = (row['account'], row['month'])
df = pd.DataFrame(res, columns = ['account', 'month', 'salary'])

Suggestion : 3

Problem is: Operation on 50 million such rows is taking 3 hours. I executed groupyby separately it is fast takes 5 seconds only.I think it is transform taking long time here. is there any way to improve performance ?, As described in the book, transform is an operation used in conjunction with groupby (which is one of the most useful operations in pandas). I suspect most pandas users likely have used aggregate, filter or apply with groupby to summarize data. However, transform is a little more difficult to understand - especially coming from an Excel world. , 2 days ago Apr 04, 2017  · As described in the book, transform is an operation used in conjunction with groupby (which is one of the most useful operations in pandas). I suspect most pandas users likely have used aggregate , filter or apply with groupby to summarize data. However, transform is a little more difficult to understand - especially coming from an Excel world. ,Note for @Uri Goren. This is a constant memory process and only has 1 group in memory at a time. This will scale linearly with the number of groups. Sorting is also unecessary.


    account month Salary 1 201501 10000 2 201506 20000 2 201506 20000 3 201508 30000 3 201508 30000 3 201506 10000 3 201506 10000 3 201506 10000 3 201506 10000

In [20]: np.random.seed(1234) In [21]: ngroups = 1000 In [22]: nrows = 50000000 In [23]: dates = pd.date_range('20000101',freq='MS',periods=ngroups) In [24]:  df = DataFrame({'account' : np.random.randint(0,ngroups,size=nrows),                  'date' : dates.take(np.random.randint(0,ngroups,size=nrows)),                  'values' : np.random.randn(nrows) }) In [25]:  In [25]: df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 50000000 entries, 0 to 49999999 Data columns (total 3 columns): account    int64 date       datetime64[ns] values     float64 dtypes: datetime64[ns](1), float64(1), int64(1) memory usage: 1.5 GB In [26]: df.head() Out[26]:     account       date    values 0      815 2048-02-01 -0.412587 1      723 2023-01-01 -0.098131 2      294 2020-11-01 -2.899752 3       53 2058-02-01 -0.469925 4      204 2080-11-01  1.389950 In [27]: %timeit df.groupby(['account','date']).sum() 1 loops, best of 3: 8.08 s per loop 

Suggestion : 4

Select rows with data closest to certain value using argsort,Create a list of dataframes, split using a delineation based on logic included in rows.,De-duplicating a large store by chunks, essentially a recursive reduction operation. Shows a function for taking in data from csv file and creating a store by chunks, with date parsing as well. See here,Using searchsorted to merge based on values inside a range

In[1]: df = pd.DataFrame(
      ...: {
         "AAA": [4, 5, 6, 7],
         "BBB": [10, 20, 30, 40],
         "CCC": [100, 50, -30, -50]
      }
      ...: )
   ...:

   In[2]: df
Out[2]:
   AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 - 30
3 7 40 - 50
In[3]: df.loc[df.AAA >= 5, "BBB"] = -1

In[4]: df
Out[4]:
   AAA BBB CCC
0 4 10 100
1 5 - 1 50
2 6 - 1 - 30
3 7 - 1 - 50
In[5]: df.loc[df.AAA >= 5, ["BBB", "CCC"]] = 555

In[6]: df
Out[6]:
   AAA BBB CCC
0 4 10 100
1 5 555 555
2 6 555 555
3 7 555 555
In[7]: df.loc[df.AAA < 5, ["BBB", "CCC"]] = 2000

In[8]: df
Out[8]:
   AAA BBB CCC
0 4 2000 2000
1 5 555 555
2 6 555 555
3 7 555 555
In[9]: df_mask = pd.DataFrame(
      ...: {
         "AAA": [True] * 4,
         "BBB": [False] * 4,
         "CCC": [True, False] * 2
      }
      ...: )
   ...:

   In[10]: df.where(df_mask, -1000)
Out[10]:
   AAA BBB CCC
0 4 - 1000 2000
1 5 - 1000 - 1000
2 6 - 1000 555
3 7 - 1000 - 1000
In[11]: df = pd.DataFrame(
      ....: {
         "AAA": [4, 5, 6, 7],
         "BBB": [10, 20, 30, 40],
         "CCC": [100, 50, -30, -50]
      }
      ....: )
   ....:

   In[12]: df
Out[12]:
   AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 - 30
3 7 40 - 50

In[13]: df["logic"] = np.where(df["AAA"] > 5, "high", "low")

In[14]: df
Out[14]:
   AAA BBB CCC logic
0 4 10 100 low
1 5 20 50 low
2 6 30 - 30 high
3 7 40 - 50 high