pandas fill forward performance issue

  • Last Update :
  • Techknowledgy :

according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:

df = df.sort_index()
df.ffill() * (1 - df.isnull().astype(int)).groupby(level = 0).cumsum().applymap(lambda x: None
   if x == 0
   else 1)

utility function: (credit to @Phun)

def ffill_se(df: pd.DataFrame, group_cols: List[str]):
   df['GROUP'] = df.groupby(group_cols).ngroup()
df.set_index(['GROUP'], inplace = True)
df.sort_index(inplace = True)
df = df.ffill() * (1 - df.isnull().astype(int)).groupby(level = 0).cumsum().applymap(lambda x: None
   if x == 0
   else 1)
df.reset_index(inplace = True, drop = True)
return df

Suggestion : 2

Performance issue pandas 6 mil rows,pandas fill forward performance issue,pandas inner join performance issue,pandas performance issue - need help to optimize

Well all you are looking for is a join.But since there is no column column, what you can do is create a column which is similar in both the dataframes and then drop it eventually.

df['common'] = 1
df1['common'] = 1

df2 = pd.merge(df, df1, on = ['common'], how = 'outer')

df = df.drop('tmp', axis = 1)

Suggestion : 3

Using the same filling arguments as reindexing, we can propagate non-NA values forward or backward:,By default, NaN values are filled in a forward direction. Use limit_direction parameter to fill backward or from both directions.,Fill gaps forward or backward,Like other pandas fill methods, interpolate() accepts a limit keyword argument. Use this argument to limit the number of consecutive NaN values filled since the last valid observation:

In[1]: df = pd.DataFrame(
      ...: np.random.randn(5, 3),
      ...: index = ["a", "c", "e", "f", "h"],
      ...: columns = ["one", "two", "three"],
      ...: )
   ...:

   In[2]: df["four"] = "bar"

In[3]: df["five"] = df["one"] > 0

In[4]: df
Out[4]:
   one two three four five
a 0.469112 - 0.282863 - 1.509059 bar True
c - 1.135632 1.212112 - 0.173215 bar False
e 0.119209 - 1.044236 - 0.861849 bar True
f - 2.104569 - 0.494929 1.071804 bar False
h 0.721555 - 0.706771 - 1.039575 bar True

In[5]: df2 = df.reindex(["a", "b", "c", "d", "e", "f", "g", "h"])

In[6]: df2
Out[6]:
   one two three four five
a 0.469112 - 0.282863 - 1.509059 bar True
b NaN NaN NaN NaN NaN
c - 1.135632 1.212112 - 0.173215 bar False
d NaN NaN NaN NaN NaN
e 0.119209 - 1.044236 - 0.861849 bar True
f - 2.104569 - 0.494929 1.071804 bar False
g NaN NaN NaN NaN NaN
h 0.721555 - 0.706771 - 1.039575 bar True
In[7]: df2["one"]
Out[7]:
   a 0.469112
b NaN
c - 1.135632
d NaN
e 0.119209
f - 2.104569
g NaN
h 0.721555
Name: one, dtype: float64

In[8]: pd.isna(df2["one"])
Out[8]:
   a False
b True
c False
d True
e False
f False
g True
h False
Name: one, dtype: bool

In[9]: df2["four"].notna()
Out[9]:
   a True
b False
c True
d False
e True
f True
g False
h True
Name: four, dtype: bool

In[10]: df2.isna()
Out[10]:
   one two three four five
a False False False False False
b True True True True True
c False False False False False
d True True True True True
e False False False False False
f False False False False False
g True True True True True
h False False False False False
In[11]: None == None # noqa: E711
Out[11]: True

In[12]: np.nan == np.nan
Out[12]: False
In[13]: df2["one"] == np.nan
Out[13]:
   a False
b False
c False
d False
e False
f False
g False
h False
Name: one, dtype: bool
In [14]: pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype())
Out[14]:
0 1
1 2
2 <NA>
   3 4
   dtype: Int64
In[15]: df2 = df.copy()

In[16]: df2["timestamp"] = pd.Timestamp("20120101")

In[17]: df2
Out[17]:
   one two three four five timestamp
a 0.469112 - 0.282863 - 1.509059 bar True 2012 - 01 - 01
c - 1.135632 1.212112 - 0.173215 bar False 2012 - 01 - 01
e 0.119209 - 1.044236 - 0.861849 bar True 2012 - 01 - 01
f - 2.104569 - 0.494929 1.071804 bar False 2012 - 01 - 01
h 0.721555 - 0.706771 - 1.039575 bar True 2012 - 01 - 01

In[18]: df2.loc[["a", "c", "h"], ["one", "timestamp"]] = np.nan

In[19]: df2
Out[19]:
   one two three four five timestamp
a NaN - 0.282863 - 1.509059 bar True NaT
c NaN 1.212112 - 0.173215 bar False NaT
e 0.119209 - 1.044236 - 0.861849 bar True 2012 - 01 - 01
f - 2.104569 - 0.494929 1.071804 bar False 2012 - 01 - 01
h NaN - 0.706771 - 1.039575 bar True NaT

In[20]: df2.dtypes.value_counts()
Out[20]:
   float64 3
object 1
bool 1
datetime64[ns] 1
dtype: int64