pandas rolling returns nan when infinity values are involved

  • Last Update :
  • Techknowledgy :

np.inf is explicitly converted to np.NaN in pandas/core/window/rolling.py

# Convert inf to nan
for C funcs
inf = np.isinf(values)
if inf.any():
   values = np.where(inf, np.nan, values)

You'd find the exact same behavior if you had NaN instead of np.inf. It can be difficult to get your output because min_counts will throw away those intermediate groups because they lack sufficient observations. One clean "hack" is to replace inf with the biggest value you can, which should be rather safe taking 'min'.

import numpy as np
s.replace(np.inf, np.finfo('float64').max).rolling(3).min()

#0    NaN
# 1 NaN
#2    1.0
# 3 2.0
#4    3.0
# 5 5.0
#dtype: float64

Suggestion : 2

When using rolling on a series that contains inf values the result contains NaN even if the operation is well defined, like min or max. For example:, 1 week ago Nov 25, 2016  · The first thing to notice is that by default rolling looks for n-1 prior rows of data to aggregate, where n is the window size. If that condition is not met, it will return NaN for the window. This is what's happening at the first row. In the fourth and fifth row, it's because one of the values in the sum is NaN. , Use df.replace () to replace entire infinite values with np.nan and use pd.DataFrame.dropna (axis=0). to drop rows and axis set 1 to drop columns from the resultant Pd.DataFrame. 5. Pandas Changing Option to Consider Infinite as NaN , First is the list of values you want to replace and second with which value you want to replace the values. Pandas provide the option to use infinite as Nan. It makes the whole pandas module to consider the infinite values as nan. We can do this by using pd.set_option ().


import numpy as np
import pandas as pd s = pd.Series([1, 2, 3, np.inf, 5, 6]) print(s.rolling(window = 3).min())
import numpy as np
import pandas as pd s = pd.Series([1, 2, 3, np.inf, 5, 6]) print(s.rolling(window = 3).min())
0 NaN 1 NaN 2 1.0 3 NaN 4 NaN 5 NaN dtype: float64
0 NaN 1 NaN 2 1.0 3 2.0 4 3.0 5 5.0
s.min() # 1.0
# Convert inf to nan
for C funcs inf = np.isinf(values) if inf.any(): values = np.where(inf, np.nan, values)

Suggestion : 3

Pandas rolling returns NaN when infinity values are involved,How to apply a function not returning a numeric value to a pandas rolling Window?,pandas rolling apply return np.nan,Return predicted values from a rolling regression grouped by id using Pandas

By using loc on col the actual DataFrame is being modified in each iteration. The introduction of NaN in the column eventually means the window becomes all NaN. The easiest fix (without understanding more about how the skewness is to be applied) would be to create a copy of col to work on:

def _get_skewness(col, q = (0.05, 0.95)):
   copy_col = col.copy() # Make a copy so as to not overwrite future values.
if q[0] > 0:
   quantiles = copy_col.quantile(q)
copy_col.loc[
   (copy_col < quantiles[q[0]]) | (copy_col > quantiles[q[1]])
] = np.nan
skew = copy_col.skew(axis = 0, skipna = True)
return skew
df = pd.DataFrame(np.arange(40).reshape(-1, 2))
df_skew = df.rolling(20, 10).apply(_get_skewness)

df_skew:

      0 1
      0 NaN NaN
      1 NaN NaN
      2 NaN NaN
      3 NaN NaN
      4 NaN NaN
      5 NaN NaN
      6 NaN NaN
      7 NaN NaN
      8 NaN NaN
      9 0.0 0.0
      10 0.0 0.0
      11 0.0 0.0
      12 0.0 0.0
      13 0.0 0.0
      14 0.0 0.0
      15 0.0 0.0
      16 0.0 0.0
      17 0.0 0.0
      18 0.0 0.0
      19 0.0 0.0

Suggestion : 4

Furthermore, previously if you rank inf or -inf values together with NaN values, the calculation won’t distinguish NaN from infinity when using ‘top’ or ‘bottom’ argument.,This may subtly change the behavior of your code when you’re using .assign() to update an existing column. Previously, callables referring to other variables being updated would get the “old” values,Inserting missing values into indexes will work for all types of indexes and automatically insert the correct type of missing value (NaN, NaT, etc.) regardless of the type passed in (GH18295),The previous default behavior of negative indices in Categorical.take is deprecated. In a future version it will change from meaning missing values to meaning positional indices from the right. The future behavior is consistent with Series.take() (GH20664).

In[1]: df = pd.DataFrame({
         'foo': [1, 2, 3, 4],
         ...: 'bar': ['a', 'b', 'c', 'd'],
         ...: 'baz': pd.date_range('2018-01-01', freq = 'd', periods = 4),
         ...: 'qux': pd.Categorical(['a', 'b', 'c', 'c'])
      },
      ...: index = pd.Index(range(4), name = 'idx'))
   ...:

   In[2]: df
Out[2]:
   foo bar baz qux
idx
0 1 a 2018 - 01 - 01 a
1 2 b 2018 - 01 - 02 b
2 3 c 2018 - 01 - 03 c
3 4 d 2018 - 01 - 04 c

[4 rows x 4 columns]

In[3]: df.dtypes
Out[3]:
   foo int64
bar object
baz datetime64[ns]
qux category
Length: 4, dtype: object

In[4]: df.to_json('test.json', orient = 'table')

In[5]: new_df = pd.read_json('test.json', orient = 'table')

In[6]: new_df
Out[6]:
   foo bar baz qux
idx
0 1 a 2018 - 01 - 01 a
1 2 b 2018 - 01 - 02 b
2 3 c 2018 - 01 - 03 c
3 4 d 2018 - 01 - 04 c

[4 rows x 4 columns]

In[7]: new_df.dtypes
Out[7]:
   foo int64
bar object
baz datetime64[ns]
qux category
Length: 4, dtype: object
In[8]: df.index.name = 'index'

In[9]: df.to_json('test.json', orient = 'table')

In[10]: new_df = pd.read_json('test.json', orient = 'table')

In[11]: new_df
Out[11]:
   foo bar baz qux
0 1 a 2018 - 01 - 01 a
1 2 b 2018 - 01 - 02 b
2 3 c 2018 - 01 - 03 c
3 4 d 2018 - 01 - 04 c

[4 rows x 4 columns]

In[12]: new_df.dtypes
Out[12]:
   foo int64
bar object
baz datetime64[ns]
qux category
Length: 4, dtype: object
In[13]: df = pd.DataFrame({
   'A': [1, 2, 3]
})

In[14]: df
Out[14]:
   A
0 1
1 2
2 3

   [3 rows x 1 columns]

In[15]: df.assign(B = df.A, C = lambda x: x['A'] + x['B'])
Out[15]:
   A B C
0 1 1 2
1 2 2 4
2 3 3 6

   [3 rows x 3 columns]
In[2]: df = pd.DataFrame({
   "A": [1, 2, 3]
})

In[3]: df.assign(A = lambda df: df.A + 1, C = lambda df: df.A * -1)
Out[3]:
   A C
0 2 - 1
1 3 - 2
2 4 - 3
In[16]: df.assign(A = df.A + 1, C = lambda df: df.A * -1)
Out[16]:
   A C
0 2 - 2
1 3 - 3
2 4 - 4

[3 rows x 2 columns]
In[17]: left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name = 'key1')

In[18]: left = pd.DataFrame({
         'A': ['A0', 'A1', 'A2', 'A3'],
         ....: 'B': ['B0', 'B1', 'B2', 'B3'],
         ....: 'key2': ['K0', 'K1', 'K0', 'K1']
      },
      ....: index = left_index)
   ....:

   In[19]: right_index = pd.Index(['K0', 'K1', 'K2', 'K2'], name = 'key1')

In[20]: right = pd.DataFrame({
         'C': ['C0', 'C1', 'C2', 'C3'],
         ....: 'D': ['D0', 'D1', 'D2', 'D3'],
         ....: 'key2': ['K0', 'K0', 'K0', 'K1']
      },
      ....: index = right_index)
   ....:

   In[21]: left.merge(right, on = ['key1', 'key2'])
Out[21]:
   A B key2 C D
key1
K0 A0 B0 K0 C0 D0
K1 A2 B2 K0 C1 D1
K2 A3 B3 K1 C3 D3

   [3 rows x 5 columns]

Suggestion : 5

You can use with pd.option_context('mode.use_inf_as_na',True): to consider all inf as Nan within a block of code. In python with is used to specify the scope of the block. IN case if you wanted to consider all inf as Nan in a complete program the use pd.set_option('use_inf_as_na',True).,In this article, I will explain how to drop/remove infinite values from pandas DataFrame. In order to remove infinite values, you can either first replace infinite values with NaN and remove NaN from DataFrame or use pd.set_option('use_inf_as_na',True) to consider all infinite values as Nan.,You can do using pd.set_option() to pandas provided the option to use consider infinite as NaN. It makes the entire pandas module consider the infinite values as NaN. Use the pandas.DataFrame.dropna() method to drop the rows with infinite values.,3. Using pandas.option.context() to Consider Infinite as NaN

1._
import pandas as pd
import numpy as np
technologies = {
   'Courses': ["Spark", "PySpark", "Hadoop", "Python", "pandas", np.inf, "Python", -np.inf],
   'Fee': [22000, 25000, 23000, np.inf, 26000, 25000, -np.inf, 24000],
   'Duration': ['30day', '50days', '55days', '40days', '60days', -np.inf, '55days', np.inf],
   'Discount': [1000, 2300, 1200, np.inf, 2500, -np.inf, 2000, 1500]
}
df = pd.DataFrame(technologies)
print(df)

Yields below output.

Courses Fee Duration Discount
0 Spark 22000.0 30 day 1000.0
1 PySpark 25000.0 50 days 2300.0
2 Hadoop 23000.0 55 days 1200.0
3 Python inf 40 days inf
4 pandas 26000.0 60 days 2500.0
5 inf 25000.0 - inf - inf
6 Python - inf 55 days 2000.0
7 - inf 24000.0 inf 1500.0

By using df.replace(), replace the infinite values with the NaN values and then use the pandas.DataFrame.dropna() method to remove the rows with NaN, Null/None values. This eventually drops infinite values from pandas DataFrame. inplace=True is used to update the existing DataFrame.

# Replace infinite updated data with nan
df.replace([np.inf, -np.inf], np.nan, inplace = True)
# Drop rows with NaN
df.dropna(inplace = True)
print(df)

Note: For older versions, replace use_inf_as_na with use_inf_as_null.

# Changing option context to use infinite as nan
# Drop the rows with nan or infinite values
with pd.option_context('mode.use_inf_as_na', True):
   df.dropna(inplace = True)
print(df)

Use df.replace() to replace entire infinite values with np.nan and use pd.DataFrame.dropna(axis=0). to drop rows. This ideally drops all infinite values from pandas DataFrame.

# Replace to drop rows or columns infinite values
df = df.replace([np.inf, -np.inf], np.nan).dropna(axis = 0)
print(df)

Suggestion : 6

Here, we replace the infinity values with nan values first. Next, we create a new column, which contains the future HPI. We can do this with a new method: .shift(). This method will shift the column in question. Shifting by -1 means we're shifting down, so the value for the next point is moved back. This is our quick way of having the current value, and the next period's value on the same row for easy comparison. ,This data analysis with Python and Pandas tutorial is going to cover two topics. First, within the context of machine learning, we need a way to create "labels" for our data. Second, we're going to cover mapping functions and the rolling apply capability with Pandas.,Next up, we will have some NaN data from both the percent change application and the shift, so we need to do:,Creating labels is essential for the supervised machine learning process, as it is used to "teach" or train the machine correct answers that are associated with features.

To start, we will have some code like:

import Quandl
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib
import style
import numpy as np
from statistics
import mean
style.use('fivethirtyeight')

housing_data = pd.read_pickle('HPI.pickle')
housing_data = housing_data.pct_change()

Next:

housing_data.replace([np.inf, -np.inf], np.nan, inplace = True)
housing_data['US_HPI_future'] = housing_data['United States'].shift(-1)

Next up, we will have some NaN data from both the percent change application and the shift, so we need to do:

housing_data.dropna(inplace = True)

Here, we're obviously passing the current HPI and the future HPI columns. If the future HPI is higher than the current, this means prices went up, and we are going to return a 1. This is going to be our label. If the future HPI is not greater than the current, then we return a simple 0. To map this function, we can do something like:

housing_data['label'] = list(map(create_labels, housing_data['United States'], housing_data['US_HPI_future']))

This might look like a confusing one-liner, but it doesn't need to be. It breaks down to:

new_column = list(map(function_to_map, parameter1, parameter2, ...))