Tuesday, October 29, 2019

Principal Component Analysis

Principal Component Analysis

Is a linal based technique to reduce dimensionality of features while capturing the maximum information from the features. In other words, say, you have 10 features/variables describing a dataset. PCA helps one reduce the 10 features to a fewer number while capturing the effect of the all the variables without dropping any variable. This is important because, in techniques, like linear regression, we would normally compromise by dropping variables which have high multicollinearity. The PCA essentially compacts high multicollinear variables into one component, thereby reducing all features into a smaller set of "principal components" that affect the outcome. A key tenet in this is capturing the variance amongst variables in decreasing order of magnitude. That means there is an order to the principal components, with the first component explaining the effect on the outcome to the greatest extent and so on.

Practical Considerations
  1. PCA is good with linear relationships, since it reduces variables to components that are expressed as a linear combination of the original variables (and hence good to use this on techniques such as logistics regression. Even otherwise PCA can be used for more efficient computation). However, in situations where you need non-linear treatment, the t-SNE (t-distributed Stochastic Neighbourhood Embedding) is an alternative although it is computationally intensive/prohibitive. Use it only when data is reasonably small. 
  2. PCA assumes orthogonality to capture variances. Data may not be always be like that. In such situations, ICA (Independent Components Analyis) is a better option though it is again computationally intensive.
  3. PCA de-emphasizes variables with low variance. This may not work well in situations where low-variance is also important, such as, fraud cases where data is small to begin with.






No comments:

Post a Comment