What is multicollinearity?

Multicollinearity. What is Multicollinearity?

Multi-what? Multicollinearity occurs when independent variables in a regression (multiple regression specifically) model are correlated.
A regression model will have dependent and independent variables. This correlation between independent variables is a problem because independent variables should be: independent.
If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results.

When independent variables are correlated, it indicates that changes in one variable are associated with shifts in another variable. The stronger the correlation, the more difficult it is to change one variable without changing another and it becomes difficult for the model to estimate the relationship between each independent variable and the dependent variable because the independent variables tend to change together.

In other words, one independent (predictor) variable can be used to predict another independent variable. This creates redundant information, skewing the results in a regression model. Examples of correlated predictor variables (also called multicollinear predictors) are:

a person’s height and weight, age
years of education and annual income

A key goal of regression analysis is to isolate the relationship between each independent variable and the dependent variable. The interpretation of a regression coefficient is that it represents the mean change in the dependent variable for each 1 unit change in an independent variable when you hold all of the other independent variables constant.

The stronger the correlation, the more difficult it is to change one variable without changing another making it difficult for the model to estimate the relationship between each independent variable and the dependent variable because the independent variables tend to change together.

How to detect / find and remove multicollinearity?

An easy way to detect multicollinearity is to calculate correlation coefficients for all pairs of predictor variables (also see variance inflation factor). If the correlation coefficient is exactly +1 or -1, this is called perfect multicollinearity. If it is close to or exactly -1 or +1, one of the variables should be removed from the model if at all possible.

Is it necessary to remove multicollinearity?

Yes and no; depending on the application of your regression model.
Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics. If your primary goal is to make predictions, and you don’t need to understand the role of each independent variable, you don’t need to reduce severe multicollinearity.

However, coefficient estimates can swing wildly as they become very sensitive to small changes in the model. Multicollinearity reduces the precision of the coefficients, which weakens your regression model. And you may not be able to trust the p-values to identify independent variables that are statistically significant. In general, multicollinearity can lead to wider confidence intervals that produce less reliable probabilities in terms of the effect of independent variables in a model. That is, the statistical inferences from a model with multicollinearity may not be dependable.
If your goal of building a model is gain a deeper understanding of the variables involved, then multicollinearity must go.

Business Analytics

Forecasting Models

Predictive Modelling

Grow your business with:

-> Analytic Insights as a Service

-> CBSE Curriculum Aligned AI-Python Faculty / Teacher Development Training

-> Trainer On-Demand / Visiting Faculty

-> Professional and Affordable Website Development

-> Data Labeling Services for AI / ML / DL Models

Google Advert