Principal Component Regression (PCR) is a statistical technique for regression analysis that is used to reduce the dimensionality of a dataset by projecting it onto a lower-dimensional subspace. This is done by finding a set of orthogonal (i.e., uncorrelated) linear combinations of the original variables, called principal components, that capture the most variance in the data. The principal components are used as predictors in the regression model, instead of the original variables.
PCR is often used as an alternative to multiple linear regression, especially when the number of variables is large or when the variables are correlated. By using PCR, we can reduce the number of variables in the model and improve the interpretability and stability of the regression results.
Features of the Principal Component Regression (PCR)
Here are some key features of Principal Component Regression (PCR):
- PCR reduces the dimensionality of a dataset by projecting it onto a lower-dimensional subspace, using a set of orthogonal linear combinations of the original variables called principal components.
- PCR is often used as an alternative to multiple linear regression, especially when the number of variables is large or when the variables are correlated.
- By using PCR, we can reduce the number of variables in the model and improve the interpretability and stability of the regression results.
- To perform PCR, we first need to standardize the original variables and then compute the principal components using singular value decomposition (SVD) or eigendecomposition of the covariance matrix of the standardized data.
- The principal components are then used as predictors in a linear regression model, whose coefficients can be estimated using least squares regression or maximum likelihood estimation.
Breaking down the Math behind Principal Component Regression (PCR)
Here is a brief overview of the mathematical concepts underlying Principal Component Regression (PCR):
- Dimensionality reduction: PCR reduces the dimensionality of a dataset by projecting it onto a lower-dimensional subspace, using a set of orthogonal linear combinations of the original variables called principal components. This is a way of summarizing the data by capturing the most important patterns and relationships in the data while ignoring noise and irrelevant information.
- Principal components: The principal components of a dataset are the orthogonal linear combinations of the original variables that capture the most variance in the data. They are obtained by performing singular value decomposition (SVD) or eigendecomposition of the covariance matrix of the standardized data. The number of principal components is typically chosen to be the number of variables, but it can be reduced if there is a large amount of collinearity among the variables.
- Linear regression: PCR uses the principal components as predictors in a linear regression model, whose coefficients can be estimated using least squares regression or maximum likelihood estimation. The fitted model can then be used to make predictions on new data.
Overall, PCR uses mathematical concepts from linear algebra and statistics to reduce the dimensionality of a dataset and improve the interpretability and stability of regression results.
Limitations of Principal Component Regression (PCR)
While Principal Component Regression (PCR) has many advantages, it also has some limitations that should be considered when deciding whether to use it for a particular regression analysis:
- PCR only works well with linear relationships: PCR assumes that the relationship between the predictors and the response variable is linear. If the relationship is non-linear, PCR may not be able to accurately capture it, leading to biased or inaccurate predictions. In such cases, non-linear regression methods may be more appropriate.
- PCR does not handle outliers well: PCR is sensitive to outliers in the data, which can have a disproportionate impact on the principal components and the fitted regression model. Therefore, it is important to identify and handle outliers in the data before applying PCR.
- PCR may not be interpretable: PCR involves a complex mathematical procedure that generates a set of orthogonal linear combinations of the original variables. These linear combinations may not be easily interpretable, especially if the number of variables is large. In contrast, multiple linear regression is more interpretable, since it uses the original variables directly as predictors.
- PCR may not be efficient: PCR is computationally intensive, especially when the number of variables is large. Therefore, it may not be the most efficient method for regression analysis, especially when the dataset is large. In such cases, faster and more efficient regression methods may be more appropriate.
Overall, while PCR has many advantages, it is important to carefully consider its limitations and potential drawbacks before using it for regression analysis.
How Principal Component Regression (PCR) is compared to other regression analysis techniques?
Principal Component Regression (PCR) is often compared to other regression analysis techniques, such as multiple linear regression, principal component analysis (PCA), and partial least squares regression (PLSR). Here are some key differences between PCR and these other techniques:
- PCR vs. multiple linear regression: PCR is similar to multiple linear regression, in that both techniques use linear regression to model the relationship between a set of predictors and a response variable. However, PCR differs from multiple linear regression in that it reduces the dimensionality of the data by projecting it onto a lower-dimensional subspace using the principal components. This can improve the interpretability and stability of the regression results, especially when the number of variables is large or when the variables are correlated.
- PCR vs. PCA: PCR is similar to PCA, in that both techniques use principal components to reduce the dimensionality of the data. However, PCR differs from PCA in that it uses the principal components as predictors in a linear regression model, whereas PCA is an unsupervised technique that only analyzes the structure of the data itself, without using a response variable.
- PCR vs. PLSR: PCR is similar to PLSR, in that both techniques use principal components to reduce the dimensionality of the data and improve the interpretability and stability of the regression results. However, PCR differs from PLSR in that it uses the principal components as predictors in a linear regression model, whereas PLSR uses a weighted combination of the original variables as predictors in a partial least squares regression model. This allows PLSR to better capture non-linear relationships between the predictors and the response variable.
Overall, PCR is a useful technique for regression analysis that can be compared to multiple linear regression, PCA, and PLSR, depending on the specific characteristics of the data and the goals of the analysis.
Principal Component Regression (PCR) in Python:
Here is the implementation of Principal Component Regression (PCR) in Python, using the scikit-learn library:
Python3
# Import the required modules from sklearn.datasets import load_diabetes from sklearn.decomposition import PCA from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_absolute_error,\ mean_squared_error import numpy as np # Load the diabetes dataset from sklearn.pipeline import Pipeline X, y = load_diabetes(return_X_y = True ) X.shape |
Output:
(442, 10)
Now let’s reduce the dimensionality of the original dataset by half that is from 10-dimensional data to 5-dimensional data. Create a pipeline with PCA and linear regression: A pipeline is created that consists of two steps: PCA and linear regression. The PCA step is initialized with the n_components parameter set to 6, which means that only the first six principal components will be kept. The linear regression step is initialized with the default parameters.
Python3
# Create a pipeline with PCA and linear regression pca = PCA(n_components = 5 ) # Keep only the first six principal components reg = LinearRegression() pipeline = Pipeline(steps = [( 'pca' , pca), ( 'reg' , reg)]) # Fit the pipeline to the data pipeline.fit(X, y) # Predict the labels for the data y_pred = pipeline.predict(X) |
Now let’s evaluate the performance of the model by using metrics like mean absolute error, mean squared error, root mean square error, and r2 score.
Python3
# Compute the evaluation metrics mae = mean_absolute_error(y, y_pred) mse = mean_squared_error(y, y_pred) rmse = np.sqrt(mse) r2 = pipeline.score(X, y) # Print the number of features before and after PCR print (f 'Number of features before PCR: {X.shape[1]}' ) print (f 'Number of features after PCR: {pca.n_components_}' ) # Print the evaluation metrics print (f 'MAE: {mae:.2f}' ) print (f 'MSE: {mse:.2f}' ) print (f 'RMSE: {rmse:.2f}' ) print (f 'R^2: {r2:.2f}' ) |
Output:
Number of features before PCR: 10 Number of features after PCR: 5 MAE: 44.30 MSE: 2962.70 RMSE: 54.43 R^2: 0.50