Dimension reduction using python

August 17, 2020 · 3 min read

Subhrajyoti Halder

Dimension Reduction

In Machine learning each feature or columns is called as dimension. Reducing the number of column or feature is called as a dimension reduction. Dimension reduction is important because it helps our model to predict more accurate output as it gets less confused.

Will perform the dimension reduction on this kaggle data set Churn_Modelling.csv

Methods of dimension reduction

Correlation.
Backward Elimination.
PCA.
Feature Importance.

1. Correlation

Correlation is one of the methods used for dimension reduction.
It is done to find out if there are any similarities between the two columns.
If there is any correlation between two variable we drop one of them which in turn reduces the number of input variables to our model.
Python to reduce the dimension using the correlation is shown below.

import seaborn as sns
corr = X.corr()
ax = sns.heatmap(corr, annot=True)
bottom, top = ax.get_ylim()


Correlation Heatmap.

In the above heatmap you see the correlation coefficient you can set threshold if it is above threshold value you can drop that column be it positive or negative.

2. Backward Elimination

In backward elimination there few steps which are need to be followed.
Firstly you have to select the significance value (p=0.05).
If significance value of a particular column is above 0.05 we will drop that column.
First you have to fit the with all the input features you have then check for the p values of the all the features. If features have p>0.05 then check for the feature which has highest p value compared to other features then drop that feature and fit again with remaining features.
Repeat the above step till the p<0.05 for all the features.
Code shown below

1st Iteration

import statsmodels.regression.linear_model as sm

x_formated = X.loc[:,['CreditScore', 'Age', 'Tenure', 'Balance', 
                    'NumOfProducts','HasCrCard','IsActiveMember', 
                    'EstimatedSalary', 'Geography_Germany',
                    'Geography_Spain', 'Gender_Male']]

ols = sm.OLS(endog = y, exog = x_formated).fit()
ols.summary()


Dropping Geography_Spain.

2nd Iteration

import statsmodels.regression.linear_model as sm

x_formated = X.loc[:,['CreditScore', 'Age', 'Tenure', 
                    'Balance', 'NumOfProducts', 'HasCrCard',
                    'IsActiveMember', 'EstimatedSalary',  
                    'Geography_Germany','Gender_Male']]


ols = sm.OLS(endog = y, exog = x_formated).fit()
ols.summary()


Dropping EstimatedSalary.

3rd Iteration

import statsmodels.regression.linear_model as sm

x_formated = X.loc[:,['CreditScore', 'Age', 'Tenure', 
                    'Balance', 'NumOfProducts', 'HasCrCard',
                    'IsActiveMember', 'EstimatedSalary', 
                    'Geography_Germany','Gender_Male']]

ols = sm.OLS(endog = y, exog = x_formated).fit()
ols.summary()


Dropping HasCrCard.

4th Iteration

import statsmodels.regression.linear_model as sm

x_formated = X.loc[:,['CreditScore', 'Age', 'Tenure',
                     'Balance', 'NumOfProducts','IsActiveMember',
                     'Geography_Germany','Gender_Male']]

ols = sm.OLS(endog = y, exog = x_formated).fit()
ols.summary()


No column has value p>0.05.

3. Principal Component Analysis

PCA is widely used for dimension reduction.
It gives us the subset of the complete dataset.
It preserves as much data present in the whole data set in small dimension.
Code for the PCA is shown below.

from sklearn.decomposition import PCA
#n_components is number of columns you want merge the data
pca = PCA(n_components=2)
pca.fit(X)
print(pca.explained_variance_ratio_)
print(pca.singular_values_)
X_pca = pca.transform(X)

4. Feature Importance

This method helps us to predict which feature contributes how much in predicting the output.
It is done after the model is trained.

model.feature_importances_

Dimension Reduction#

1. Correlation#

2. Backward Elimination#

3. Principal Component Analysis#

4. Feature Importance#

The END#

Dimension Reduction

1. Correlation

2. Backward Elimination

3. Principal Component Analysis

4. Feature Importance

The END