showing data and model predictions in one plot using seaborn and statsmodels

  • Last Update :
  • Techknowledgy :

The first is to define a function that does the fit and then plots and pass it to

import pandas as pd
import seaborn as sns
tips = sns.load_dataset("tips")

def plot_good_tip(day, total_bill, ** kws):

   expected_tip = (total_bill.groupby(day)
      .apply(lambda x: x * .2)
      .reset_index(name = "tip"))
sns.pointplot(, expected_tip.tip,
   linestyles = ["--"], markers = ["D"])

g = sns.FacetGrid(tips, col = "sex", size = 5), "day", "tip"), "day", "total_bill")
g.set_axis_labels("day", "tip")

The second is the compute the predicted values and then merge them into your DataFrame with an additional variable that identifies what is data and what is model:

tip_predict = (tips.groupby(["day", "sex"])
   .apply(lambda x: x * .2)
   .reset_index(name = "tip"))
tip_all = pd.concat(dict(data = tips[["day", "sex", "tip"]], model = tip_predict),
   names = ["kind"]).reset_index()

sns.factorplot("day", "tip", "kind", data = tip_all, col = "sex",
   kind = "point", linestyles = ["-", "--"], markers = ["o", "D"])

Suggestion : 2

Two main functions in seaborn are used to visualize a linear relationship as determined through regression. These functions, regplot() and lmplot() are closely related, and share much of their core functionality. It is important to understand the ways they differ, however, so that you can quickly choose the correct tool for particular job.,In the presence of these kind of higher-order relationships, lmplot() and regplot() can fit a polynomial regression model to explore simple kinds of nonlinear trends in the dataset:,A few other seaborn functions use regplot() in the context of a larger, more complex plot. The first is the jointplot() function that we introduced in the distributions tutorial. In addition to the plot styles previously discussed, jointplot() can use regplot() to show the linear regression fit on the joint axes by passing kind="reg":,The residplot() function can be a useful tool for checking whether the simple regression model is appropriate for a dataset. It fits and removes a simple linear regression and then plots the residual values for each observation. Ideally, these values should be randomly scattered around y = 0:

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(color_codes = True)
tips = sns.load_dataset("tips")
sns.regplot(x = "total_bill", y = "tip", data = tips);
sns.lmplot(x = "total_bill", y = "tip", data = tips);
sns.lmplot(x = "size", y = "tip", data = tips);

Suggestion : 3

Last Updated : 15 Jan, 2022

For python environment : 

pip install seaborn

For conda environment : 

conda install seaborn

Suggestion : 4

The plot_fit function plots the fitted values versus a chosen independent variable. It includes prediction confidence intervals and optionally plots the true dependent variable.,The plot_regress_exog function is a convenience function that gives a 2x2 plot containing the dependent variable and fitted values with confidence intervals vs. the independent variable chosen, the residuals of the model vs. the chosen independent variable, a partial regression plot, and a CCPR plot. This function can be used for quickly checking modeling assumptions with respect to a single regressor.,Though the data here is not the same as in that example. You could run that example by uncommenting the necessary cells below.,As you can see the partial regression plot confirms the influence of conductor, minister, and on the partial relationship between income and prestige. The cases greatly decrease the effect of income on prestige. Dropping these cases confirms this.

% matplotlib inline
from statsmodels.compat
import lzip
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api
import ols

plt.rc("figure", figsize = (16, 8))
plt.rc("font", size = 14)
prestige = sm.datasets.get_rdataset("Duncan", "carData", cache = True).data

Suggestion : 5

A fit plot shows predicted values of the response variable versus actual values of Y. If the linear regression model is perfect, the predicted values will exactly equal the observed values and all the data points in a predicted versus actual scatterplot will fall on the 45° diagonal.,Create a plot showing the observed versus predicted values of Y. Save this to an object (in my case ax),Recall the general format of the linear regression equation: \(Y = \beta_0 + \beta_1 X_1 + ... + \beta_n X_n\), where \(Y\) is the value of the response variable and \(X_i\) is the value of the explanatory variable(s).,As in R, creating a better fit plot is a bit more work. The central issue is that the observed and predicted axis must be identical for the reference line to be 45°. To achieve this, I do the following below:

import pandas as pd
con = pd.read_csv('Data/ConcreteStrength.csv')
con.rename(columns = {
   'Fly ash': 'FlyAsh',
   'Coarse Aggr.': "CoarseAgg",
   'Fine Aggr.': 'FineAgg',
   'Air Entrainment': 'AirEntrain',
   'Compressive Strength (28-day)(Mpa)': 'Strength'
}, inplace = True)
con['AirEntrain'] = con['AirEntrain'].astype('category')
import statsmodels.api as sm
Y = con['Strength']
X = con['FlyAsh']
0 105.0
1 191.0
2 191.0
3 190.0
4 144.0
Name: FlyAsh, dtype: float64
X = sm.add_constant(X)
model = sm.OLS(Y, X, missing = 'drop')
model_result =
import seaborn as sns

Suggestion : 6

This post will walk you through building linear regression models to predict housing prices resulting from economic activity.,We have walked through setting up basic simple linear and multiple linear regression models to predict housing prices resulting from macroeconomic forces and how to assess the quality of a linear regression model on a basic level.,In this post, we'll walk through building linear regression models to predict housing prices resulting from economic activity.,Simple linear regression uses a single predictor variable to explain a dependent variable. A simple linear regression equation is as follows:

from IPython.display
import HTML, display

import statsmodels.api as sm
from statsmodels.formula.api
import ols
from statsmodels.sandbox.regression.predstd
import wls_prediction_std

import matplotlib.pyplot as plt
import seaborn as sns
   matplotlib inline

import pandas as pd
import numpy as np
root = ''

housing_price_index = pd.read_csv(root + '/monthly-hpi.csv')
unemployment = pd.read_csv(root + '/unemployment-macro.csv')
federal_funds_rate = pd.read_csv(root + '/fed_funds.csv')
shiller = pd.read_csv(root + '/shiller.csv')
gross_domestic_product = pd.read_csv(root + '/gdp.csv')
# merge dataframes into single dataframe by date
df = (shiller.merge(housing_price_index, on = 'date')
   .merge(unemployment, on = 'date')
   .merge(federal_funds_rate, on = 'date')
   .merge(gross_domestic_product, on = 'date'))
# fit our model with .fit() and show results
# we use statsmodels' formula API to invoke the syntax below,
# where we write out the formula using ~
housing_model = ols("housing_price_index ~ total_unemployed", data=df).fit()

# summarize our model
housing_model_summary = housing_model.summary()

# convert our table to HTML and add colors to headers for explanatory purposes
.replace('<th> Adj. R-squared: </th>', '<th style="background-color:#aec7e8;"> Adj. R-squared: </th>')
.replace('<th>coef</th>', '<th style="background-color:#ffbb78;">coef</th>')
.replace('<th>std err</th>', '<th style="background-color:#c7e9c0;">std err</th>')
.replace('<th>P>|t|</th>', '<th style="background-color:#bcbddc;">P>|t|</th>')
<th>0.975]</th>', '<th style="background-color:#ff9896;">[0.025</th>
<th style="background-color:#ff9896;">0.975]</th>'))
# This produces our four regression plots
for total_unemployed

fig = plt.figure(figsize = (15, 8))

# pass in the model as the first parameter, then specify the
# predictor variable we want to analyze
fig =, "total_unemployed", fig = fig)