You won't be able to get down to rolling_max
speed, but you can often shave off an order of magnitude or so by dropping down to numpy
via .values
:
def meanmax_np(ii, df):
ii = ii.astype(int)
n = df["A"].values[ii].max() + df["B"].values[ii].max()
return n / 2.0
gives me
>>> % timeit res = pd.rolling_apply(df.ii, 26, lambda x: meanmax(x, df))
1 loops, best of 3: 701 ms per loop >>>
% timeit res_np = pd.rolling_apply(df.ii, 26, lambda x: meanmax_np(x, df))
10 loops, best of 3: 31.2 ms per loop >>>
% timeit res2 = (pd.rolling_max(df['A'], 26) + pd.rolling_max(df['B'], 26)) / 2
1000 loops, best of 3: 247 µs per loop
Here, we replace the infinity values with nan values first. Next, we create a new column, which contains the future HPI. We can do this with a new method: .shift(). This method will shift the column in question. Shifting by -1 means we're shifting down, so the value for the next point is moved back. This is our quick way of having the current value, and the next period's value on the same row for easy comparison. ,This data analysis with Python and Pandas tutorial is going to cover two topics. First, within the context of machine learning, we need a way to create "labels" for our data. Second, we're going to cover mapping functions and the rolling apply capability with Pandas.,Next up, we will have some NaN data from both the percent change application and the shift, so we need to do:,Creating labels is essential for the supervised machine learning process, as it is used to "teach" or train the machine correct answers that are associated with features.
To start, we will have some code like:
import Quandl
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib
import style
import numpy as np
from statistics
import mean
style.use('fivethirtyeight')
housing_data = pd.read_pickle('HPI.pickle')
housing_data = housing_data.pct_change()
Next:
housing_data.replace([np.inf, -np.inf], np.nan, inplace = True)
housing_data['US_HPI_future'] = housing_data['United States'].shift(-1)
Next up, we will have some NaN data from both the percent change application and the shift, so we need to do:
housing_data.dropna(inplace = True)
Here, we're obviously passing the current HPI and the future HPI columns. If the future HPI is higher than the current, this means prices went up, and we are going to return a 1. This is going to be our label. If the future HPI is not greater than the current, then we return a simple 0. To map this function, we can do something like:
housing_data['label'] = list(map(create_labels, housing_data['United States'], housing_data['US_HPI_future']))
This might look like a confusing one-liner, but it doesn't need to be. It breaks down to:
new_column = list(map(function_to_map, parameter1, parameter2, ...))
Faster way to apply pandas rolling on multiple date windows,Python Pandas rolling window apply multiple lambda functions,Faster way to index pandas dataframe multiple times,Pandas rolling apply using multiple columns
You don't even need a loop or make chunks to apply a rolling function. It a lot simpler than this.
def add_last_n_days_avg_with_days_at_index(df, label_col = 'label', count_of_days = 7, round_to = 0): new_label_col_name = label_col + '_' + str(count_of_days) + 'D' # create a new column, apply mean and round df[new_label_col_name] = df[label_col].rolling(count_of_days).mean().round(round_to); # I removed match_on_col parameter add_last_n_days_avg_with_days_at_index(df = dfn, label_col = 'label', count_of_days = 7, round_to = 0)
Result:
A B C date dim_vector label label_7D
date
2018 - 12 - 14 1 r a 2018 - 12 - 14 1_ r_a 1 NaN
2018 - 12 - 15 1 r a 2018 - 12 - 15 1_ r_a 7 NaN
2018 - 12 - 16 1 r a 2018 - 12 - 16 1_ r_a 8 NaN
2018 - 12 - 17 1 r a 2018 - 12 - 17 1_ r_a 7 NaN
2018 - 12 - 18 1 r a 2018 - 12 - 18 1_ r_a 5 NaN
2018 - 12 - 19 1 r a 2018 - 12 - 19 1_ r_a 7 NaN
2018 - 12 - 20 1 r a 2018 - 12 - 20 1_ r_a 1 5.0
2018 - 12 - 21 1 r a 2018 - 12 - 21 1_ r_a 6 6.0
2018 - 12 - 22 1 r a 2018 - 12 - 22 1_ r_a 9 6.0
2018 - 12 - 23 1 r a 2018 - 12 - 23 1_ r_a 1 5.0
2018 - 12 - 24 1 r a 2018 - 12 - 24 1_ r_a 1 4.0
2018 - 12 - 25 1 r a 2018 - 12 - 25 1_ r_a 0 4.0
2018 - 12 - 26 1 r a 2018 - 12 - 26 1_ r_a 3 3.0
2018 - 12 - 27 1 r a 2018 - 12 - 27 1_ r_a 0 3.0
2018 - 12 - 28 1 r a 2018 - 12 - 28 1_ r_a 0 2.0
2018 - 12 - 29 1 r a 2018 - 12 - 29 1_ r_a 9 2.0
2018 - 12 - 30 1 r a 2018 - 12 - 30 1_ r_a 1 2.0
2018 - 12 - 31 1 r a 2018 - 12 - 31 1_ r_a 2 2.0
2019 - 01 - 01 1 r a 2019 - 01 - 01 1_ r_a 1 2.0
2019 - 01 - 02 1 r a 2019 - 01 - 02 1_ r_a 9 3.0