according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:
df = df.sort_index()
df.ffill() * (1 - df.isnull().astype(int)).groupby(level = 0).cumsum().applymap(lambda x: None
if x == 0
else 1)
utility function: (credit to @Phun)
def ffill_se(df: pd.DataFrame, group_cols: List[str]):
df['GROUP'] = df.groupby(group_cols).ngroup()
df.set_index(['GROUP'], inplace = True)
df.sort_index(inplace = True)
df = df.ffill() * (1 - df.isnull().astype(int)).groupby(level = 0).cumsum().applymap(lambda x: None
if x == 0
else 1)
df.reset_index(inplace = True, drop = True)
return df
Performance issue pandas 6 mil rows,pandas fill forward performance issue,pandas inner join performance issue,pandas performance issue - need help to optimize
Well all you are looking for is a join.But since there is no column column, what you can do is create a column which is similar in both the dataframes and then drop it eventually.
df['common'] = 1
df1['common'] = 1
df2 = pd.merge(df, df1, on = ['common'], how = 'outer')
df = df.drop('tmp', axis = 1)
Using the same filling arguments as reindexing, we can propagate non-NA values forward or backward:,By default, NaN values are filled in a forward direction. Use limit_direction parameter to fill backward or from both directions.,Fill gaps forward or backward,Like other pandas fill methods, interpolate() accepts a limit keyword argument. Use this argument to limit the number of consecutive NaN values filled since the last valid observation:
In[1]: df = pd.DataFrame(
...: np.random.randn(5, 3),
...: index = ["a", "c", "e", "f", "h"],
...: columns = ["one", "two", "three"],
...: )
...:
In[2]: df["four"] = "bar"
In[3]: df["five"] = df["one"] > 0
In[4]: df
Out[4]:
one two three four five
a 0.469112 - 0.282863 - 1.509059 bar True
c - 1.135632 1.212112 - 0.173215 bar False
e 0.119209 - 1.044236 - 0.861849 bar True
f - 2.104569 - 0.494929 1.071804 bar False
h 0.721555 - 0.706771 - 1.039575 bar True
In[5]: df2 = df.reindex(["a", "b", "c", "d", "e", "f", "g", "h"])
In[6]: df2
Out[6]:
one two three four five
a 0.469112 - 0.282863 - 1.509059 bar True
b NaN NaN NaN NaN NaN
c - 1.135632 1.212112 - 0.173215 bar False
d NaN NaN NaN NaN NaN
e 0.119209 - 1.044236 - 0.861849 bar True
f - 2.104569 - 0.494929 1.071804 bar False
g NaN NaN NaN NaN NaN
h 0.721555 - 0.706771 - 1.039575 bar True
In[7]: df2["one"]
Out[7]:
a 0.469112
b NaN
c - 1.135632
d NaN
e 0.119209
f - 2.104569
g NaN
h 0.721555
Name: one, dtype: float64
In[8]: pd.isna(df2["one"])
Out[8]:
a False
b True
c False
d True
e False
f False
g True
h False
Name: one, dtype: bool
In[9]: df2["four"].notna()
Out[9]:
a True
b False
c True
d False
e True
f True
g False
h True
Name: four, dtype: bool
In[10]: df2.isna()
Out[10]:
one two three four five
a False False False False False
b True True True True True
c False False False False False
d True True True True True
e False False False False False
f False False False False False
g True True True True True
h False False False False False
In[11]: None == None # noqa: E711
Out[11]: True
In[12]: np.nan == np.nan
Out[12]: False
In[13]: df2["one"] == np.nan
Out[13]:
a False
b False
c False
d False
e False
f False
g False
h False
Name: one, dtype: bool
In [14]: pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype())
Out[14]:
0 1
1 2
2 <NA>
3 4
dtype: Int64
In[15]: df2 = df.copy()
In[16]: df2["timestamp"] = pd.Timestamp("20120101")
In[17]: df2
Out[17]:
one two three four five timestamp
a 0.469112 - 0.282863 - 1.509059 bar True 2012 - 01 - 01
c - 1.135632 1.212112 - 0.173215 bar False 2012 - 01 - 01
e 0.119209 - 1.044236 - 0.861849 bar True 2012 - 01 - 01
f - 2.104569 - 0.494929 1.071804 bar False 2012 - 01 - 01
h 0.721555 - 0.706771 - 1.039575 bar True 2012 - 01 - 01
In[18]: df2.loc[["a", "c", "h"], ["one", "timestamp"]] = np.nan
In[19]: df2
Out[19]:
one two three four five timestamp
a NaN - 0.282863 - 1.509059 bar True NaT
c NaN 1.212112 - 0.173215 bar False NaT
e 0.119209 - 1.044236 - 0.861849 bar True 2012 - 01 - 01
f - 2.104569 - 0.494929 1.071804 bar False 2012 - 01 - 01
h NaN - 0.706771 - 1.039575 bar True NaT
In[20]: df2.dtypes.value_counts()
Out[20]:
float64 3
object 1
bool 1
datetime64[ns] 1
dtype: int64