Note for @Uri Goren. This is a constant memory process and only has 1 group in memory at a time. This will scale linearly with the number of groups. Sorting is also unecessary.
In [20]: np.random.seed(1234)
In [21]: ngroups = 1000
In [22]: nrows = 50000000
In [23]: dates = pd.date_range('20000101',freq='MS',periods=ngroups)
In [24]: df = DataFrame({'account' : np.random.randint(0,ngroups,size=nrows),
'date' : dates.take(np.random.randint(0,ngroups,size=nrows)),
'values' : np.random.randn(nrows) })
In [25]:
In [25]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000000 entries, 0 to 49999999
Data columns (total 3 columns):
account int64
date datetime64[ns]
values float64
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 1.5 GB
In [26]: df.head()
Out[26]:
account date values
0 815 2048-02-01 -0.412587
1 723 2023-01-01 -0.098131
2 294 2020-11-01 -2.899752
3 53 2058-02-01 -0.469925
4 204 2080-11-01 1.389950
In [27]: %timeit df.groupby(['account','date']).sum()
1 loops, best of 3: 8.08 s per loop
If you want to transform the output, then doit like this
In[37]: g = df.groupby(['account', 'date'])['values']
In[38]: result = 100 * df['values'] / g.transform('sum')
In[41]: result.head()
Out[41]:
0 4.688957
1 - 2.340621
2 - 80.042089
3 - 13.813078
4 - 70.857014
dtype: float64
In[43]: len(result)
Out[43]: 50000000
In[42]: % timeit 100 * df['values'] / g.transform('sum')
1 loops, best of 3: 30.9 s per loop
I would use a different approach First Sort,
MyDataFrame.sort(['account', 'month'], inplace = True)
Then iterate and sum
(account, month) = ('', '') #some invalid values
salary = 0.0
res = []
for index, row in MyDataFrame.iterrows():
if (row['account'], row['month']) == (account, month):
salary += row['salary']
else:
res.append([account, month, salary])
salary = 0.0(account, month) = (row['account'], row['month'])
df = pd.DataFrame(res, columns = ['account', 'month', 'salary'])
Pandas groupby+transform on 50 million rows is taking 3 hours,Pandas groupby + transform taking hours for 600 Million records,Working on 50 million rows in pandas (python),Fastest way to iterate over 70 million rows in pandas dataframe
Note for @Uri Goren. This is a constant memory process and only has 1 group in memory at a time. This will scale linearly with the number of groups. Sorting is also unecessary.
In [20]: np.random.seed(1234)
In [21]: ngroups = 1000
In [22]: nrows = 50000000
In [23]: dates = pd.date_range('20000101',freq='MS',periods=ngroups)
In [24]: df = DataFrame({'account' : np.random.randint(0,ngroups,size=nrows),
'date' : dates.take(np.random.randint(0,ngroups,size=nrows)),
'values' : np.random.randn(nrows) })
In [25]:
In [25]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000000 entries, 0 to 49999999
Data columns (total 3 columns):
account int64
date datetime64[ns]
values float64
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 1.5 GB
In [26]: df.head()
Out[26]:
account date values
0 815 2048-02-01 -0.412587
1 723 2023-01-01 -0.098131
2 294 2020-11-01 -2.899752
3 53 2058-02-01 -0.469925
4 204 2080-11-01 1.389950
In [27]: %timeit df.groupby(['account','date']).sum()
1 loops, best of 3: 8.08 s per loop
If you want to transform the output, then doit like this
In[37]: g = df.groupby(['account', 'date'])['values']
In[38]: result = 100 * df['values'] / g.transform('sum')
In[41]: result.head()
Out[41]:
0 4.688957
1 - 2.340621
2 - 80.042089
3 - 13.813078
4 - 70.857014
dtype: float64
In[43]: len(result)
Out[43]: 50000000
In[42]: % timeit 100 * df['values'] / g.transform('sum')
1 loops, best of 3: 30.9 s per loop
I would use a different approach First Sort,
MyDataFrame.sort(['account', 'month'], inplace = True)
Then iterate and sum
(account, month) = ('', '') #some invalid values
salary = 0.0
res = []
for index, row in MyDataFrame.iterrows():
if (row['account'], row['month']) == (account, month):
salary += row['salary']
else:
res.append([account, month, salary])
salary = 0.0(account, month) = (row['account'], row['month'])
df = pd.DataFrame(res, columns = ['account', 'month', 'salary'])
Problem is: Operation on 50 million such rows is taking 3 hours. I executed groupyby separately it is fast takes 5 seconds only.I think it is transform taking long time here. is there any way to improve performance ?, As described in the book, transform is an operation used in conjunction with groupby (which is one of the most useful operations in pandas). I suspect most pandas users likely have used aggregate, filter or apply with groupby to summarize data. However, transform is a little more difficult to understand - especially coming from an Excel world. , 2 days ago Apr 04, 2017 · As described in the book, transform is an operation used in conjunction with groupby (which is one of the most useful operations in pandas). I suspect most pandas users likely have used aggregate , filter or apply with groupby to summarize data. However, transform is a little more difficult to understand - especially coming from an Excel world. ,Note for @Uri Goren. This is a constant memory process and only has 1 group in memory at a time. This will scale linearly with the number of groups. Sorting is also unecessary.
account month Salary 1 201501 10000 2 201506 20000 2 201506 20000 3 201508 30000 3 201508 30000 3 201506 10000 3 201506 10000 3 201506 10000 3 201506 10000
In [20]: np.random.seed(1234) In [21]: ngroups = 1000 In [22]: nrows = 50000000 In [23]: dates = pd.date_range('20000101',freq='MS',periods=ngroups) In [24]: df = DataFrame({'account' : np.random.randint(0,ngroups,size=nrows), 'date' : dates.take(np.random.randint(0,ngroups,size=nrows)), 'values' : np.random.randn(nrows) }) In [25]: In [25]: df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 50000000 entries, 0 to 49999999 Data columns (total 3 columns): account int64 date datetime64[ns] values float64 dtypes: datetime64[ns](1), float64(1), int64(1) memory usage: 1.5 GB In [26]: df.head() Out[26]: account date values 0 815 2048-02-01 -0.412587 1 723 2023-01-01 -0.098131 2 294 2020-11-01 -2.899752 3 53 2058-02-01 -0.469925 4 204 2080-11-01 1.389950 In [27]: %timeit df.groupby(['account','date']).sum() 1 loops, best of 3: 8.08 s per loop
Select rows with data closest to certain value using argsort,Create a list of dataframes, split using a delineation based on logic included in rows.,De-duplicating a large store by chunks, essentially a recursive reduction operation. Shows a function for taking in data from csv file and creating a store by chunks, with date parsing as well. See here,Using searchsorted to merge based on values inside a range
In[1]: df = pd.DataFrame(
...: {
"AAA": [4, 5, 6, 7],
"BBB": [10, 20, 30, 40],
"CCC": [100, 50, -30, -50]
}
...: )
...:
In[2]: df
Out[2]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 - 30
3 7 40 - 50
In[3]: df.loc[df.AAA >= 5, "BBB"] = -1
In[4]: df
Out[4]:
AAA BBB CCC
0 4 10 100
1 5 - 1 50
2 6 - 1 - 30
3 7 - 1 - 50
In[5]: df.loc[df.AAA >= 5, ["BBB", "CCC"]] = 555
In[6]: df
Out[6]:
AAA BBB CCC
0 4 10 100
1 5 555 555
2 6 555 555
3 7 555 555
In[7]: df.loc[df.AAA < 5, ["BBB", "CCC"]] = 2000
In[8]: df
Out[8]:
AAA BBB CCC
0 4 2000 2000
1 5 555 555
2 6 555 555
3 7 555 555
In[9]: df_mask = pd.DataFrame(
...: {
"AAA": [True] * 4,
"BBB": [False] * 4,
"CCC": [True, False] * 2
}
...: )
...:
In[10]: df.where(df_mask, -1000)
Out[10]:
AAA BBB CCC
0 4 - 1000 2000
1 5 - 1000 - 1000
2 6 - 1000 555
3 7 - 1000 - 1000
In[11]: df = pd.DataFrame(
....: {
"AAA": [4, 5, 6, 7],
"BBB": [10, 20, 30, 40],
"CCC": [100, 50, -30, -50]
}
....: )
....:
In[12]: df
Out[12]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 - 30
3 7 40 - 50
In[13]: df["logic"] = np.where(df["AAA"] > 5, "high", "low")
In[14]: df
Out[14]:
AAA BBB CCC logic
0 4 10 100 low
1 5 20 50 low
2 6 30 - 30 high
3 7 40 - 50 high