2023. 5. 19. 02:41ㆍData science/Machine Learning
Principal Components Analysis (PCA) is a popular technique for analysing high-dimensional data. PCA is a mathematical procedure that transforms a set of correlated variables into a new set of uncorrelated variables, called "Principal Components" while retaining most of the variability of the original data.
To perform PCA on a dataset of N, p-dimensional data point x_j ∈ Rᴾ, we need to follow these steps.
1. Standardize the data: To ensure all variables are in a similar range, subtract the mean and divide by the standard deviation for each variable.
2. Compute the covariance matrix: The covariance matrix summarizes the relationships between the variables in the data. The covariance matrix can be computed by using the formula, "cov(x)= 1/(n-1)* XᵀX" (X is the standardized data matrix).
3. Compute the eigenvectors and eigenvalues (Eigensystem): These represent the directions and magnitude of the maximum data variability. ( 𝛌 and 𝚺 )
4. Sort the eigenvectors in descending order of their corresponding eigenvalues: The eigenvectors with the highest eigenvalues represent the directions of maximum variability in the data. These directions are called "Principal components".
5. Choose the number of principal components: Use a scree plot to determine the number of components to keep. (scree plot: Plot of eigenvalues against the number of components)
6. Compute the principal components: by multiplying the standardized data matrix by the eigenvectors of the covariance matrix
7. Calculate the percentage of the total sample variance explained by the i-th principal components ( 𝛌_i / ∑ 𝛌_j ) * 100 %, j-th element of the i-th eigenvector v_i is called "loading" of the j-th variable onto the i-th principal component.
(*Loading: Relationship between the original variables and principal components. ; PC_i = v_i1 * x_1 + ... + v_ip *x_p )
The result of PCA will produce principal components, which are linear combinations of original variables. These PCs will be uncorrelated.
PCA can be used to reduce the dimensionality of data, making it easier to visualize and analyze. It can also be used to identify patterns and relationships in the data (correlation), and data compression.
'Data science > Machine Learning' 카테고리의 다른 글
Markov chain Monte Carlo (0) | 2023.05.20 |
---|---|
Unsupervised or Supervised Classification (0) | 2023.05.19 |
Machine Learning and Stats 2 - Univariate Exploratory Data Analysis (0) | 2023.01.17 |
[IBM] What is Data Science? - Deep Learning & Machine Learning (0) | 2021.05.15 |