Intro to Linear Regression: Homoscedastic and normally distributed residuals



In this post, I will demonstrate the bare necessities in order to build or fit a linear regression model. When selecting linear regression to fit your model, you must be aware that it makes a few assumptions about the data you're trying to fit. 
1 - Homoscedasticity of residuals.
2 - Normal distribution of residuals.
3 - No multicollinearity among the dataset features. 
4 - Dataset observations are independent.

What is homoscedasticity ? And what are residuals?

When you fit a model to some dataset, there is a difference between the predicted value made by the model and the real value. This value is called a residual. 

Simply put, the homoscedasticity of residuals is when the error/noise is the same across all of the features that are factored into building the model. It is measured by its level of intensity. 

A quick example to paint a picture, imagine we had a dataset on athletes and their ability to lift weights. The assumption is that there is great correlation between the athlete's strength and the maximum weight they can lift. However, after looking at the plot of results, we find that the residuals for weaker athletes is extremely low while there is a much wider variation in residuals for stronger athletes. We thought, naively, that as any athlete's strength increased, their maximum weight lift increased as well, when in reality that may not be the case. There may be other factors that contribute in predicting max weight lift other then strength, such as weight, age, or body fat percentage. 

The first two assumptions are based on residuals. Ensuring homoscedastic, normally distributed residuals reinforces our confidence in using linear regression to build the model. 

(Wikipedia) The least-squares method finds the optimal parameter values by minimizing the sum, , of squared residuals:






Because linear regression uses least squares to minimize the residuals, the high variance between residuals produces an S value that is potentially much larger. Check the wiki link here for more. 

I hope I haven't jammed too much theory. Lets continue by looking at a quick example. 


Model Building and EDA

We'll be using a dataset from statsmodels library. So go ahead import, and load and the following.

import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.pyplot as py
import pandas as pd
from sklearn.preprocessing import scale

from statsmodels.stats.outliers_influence import variance_inflation_factor

The dataset ccard represents customers of bank. The value to be predicted (dependent-variable/endog) is AVGEXP, which is the average expenses made by each customer. The independent variables (exog) are age, income, incomesq (engineered from income), and ownrent, which is a categorical variable representing whether the customer owns or rents his place of living. 



Now, let's load the dataset, fit a linear regression model, and get a quick summary.

credit_card = sm.datasets.ccard.load_pandas()
credit_model= sm.OLS(endog=credit_card.endog,
     exog=credit_card.exog).fit()

credit_model.summary()





A few things to note here; 

The income variable has a substantially higher coefficient value relative to the other variables, meaning that this model is highly dependable on it. Now while that case might be true, and we should just close this case and go home, there are a few other things to consider here; 

1 - The R-Squared value is low.
2 - The F-statistic is low.
3 - The p-values of the variables.


Is it normally distributed?

In cases like these, we want to look at the relationship, individually, between the independent variables and the output variable. One way to do this is with a scatter matrix. 


In all three plots that represent the relationship between endog and the independent values (age, income and incomesq), you'll notice a right skewed distribution. 

A right skewed distribution is characterized by a low mean value and high variance. Considering all three plots, individually, they really do look like they have low mean values and an extremely high variance, with the possibility of outliers. 

For better, more clear insight, lets look at the box plot of each independent variable. 


credit_card.data.boxplot(column='AGE',return_type='dict');
plt.figure()
credit_card.data.boxplot(column='INCOME',return_type='dict');
plt.figure()
credit_card.data.boxplot(column='AVGEXP',return_type='dict');







All three box plots show invigorate the case we stated before, most of the values are on the lower side, and also include extremely high variables with high variance. 

Let's look at the distribution of the endog (value to be predicted).

credit_card.endog.hist()
plt.show()




This type of distribution is called lognormal distribution, which is a probability distribution with a normally distributed logarithm. Any variable, or population for that matter is log-normally distributed if its logarith is normally distributed, meaning the distribution of its log is normal. Let's take a look using numpy to see. 


np.log(credit_card.endog).hist()
plt.show()





It does look like a normal distribution. Therefore, we can conclude that this dataset is not normally distributed, it's log-normally distributed. 

What about the lack of homoscedasticity?

Before anything, the opposite of homoscedasticity is (yes, you guessed it) heterscedasticity. The most common practice for identifying heteroscedasticity is to visually inspect the residual plot.


student_resid = credit_model.outlier_test()['student_resid']
plt.scatter(credit_model.fittedvalues, student_resid)
plt.xlabel('Fitted values of AVGEXP')
plt.ylabel('Studentized Residuals')
plt.show()





Now if we take the log of the distribution and fit the model to it. 

credit_card = sm.datasets.ccard.load_pandas()
credit_model= sm.OLS(endog=np.log(credit_card.endog),
     exog=credit_card.exog).fit()



Now, run the same code, again

student_resid = credit_model.outlier_test()['student_resid']
plt.scatter(credit_model.fittedvalues, student_resid)
plt.xlabel('Fitted values of AVGEXP')
plt.ylabel('Studentized Residuals')
plt.show()



As you can see, the residuals are more properly spread, more homoscedastic. Lets take a look at the model summary after fitting the log of the distribution. 





As you can see, the R-squared value is much better, as well as the F-statistic. Also, the p-values of the coefficients are more reliable and coefficients look more dependable now. 

Our work isn't done here, continue onto this article to see how we get rid of the outliers.