PCA_biplot {rYWAASB}R Documentation

The PCA biplot with loadings

Description

[Stable]

Usage

PCA_biplot(datap, lowt = FALSE)

Arguments

datap

The data set

lowt

A parameter indicating whether lower rates of the trait is preferred or not. For grain yield e.g. Upper values is preferred. For plant height lower values e.g. is preferred.

Details

PCA is a machine learning method and dimension reduction technique. It is utilized to simplify large data sets by extracting a smaller set that preserves significant patterns and trends(1). According to Johnson and Wichern (2007), a PCA explains the var-covar structure of a set of variables \(X_1, X_2, ..., X_p\) with a less linear combinations of such variables. Moreover the common objective of PCA is 1) data reduction and 2) interpretation.

Biplot and PCA: The biplot is a method used to visually represent both the rows and columns of a data table. It involves approximating the table using a two-dimensional matrix product, with the aim of creating a plane that represents the rows and columns. The techniques used in a biplot typically involve an eigen decomposition, similar to the one used in PCA. It is common for the biplot to be conducted using mean-centered and scaled data(2). For scaling variables, the data can be transformed as follow: \[z = \frac{x-\bar{x}}{s(x)}\] where 's(x)' denotes the sample standard deviation of 'x' parameter, calculated as: \[s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2}\] Algebra of PCA: As Johnson and Wichern (2007) stated(3), if the random vector \(\mathbf{X'} = {X_1, X_2,...,X_p }\) have the covariance matrix \(\sum\) with eigenvalues \(\lambda_1 \ge \lambda_2 \ge ... \ge \lambda_p \ge 0\).

Regarding the linear combinations: \[Y_1 = a'_1X = a_{11}X_1 + a_{12}X_2 + ... + a_{1P}X_p \] \[Y_2 = a'_2X = a_{21}X_1 + a_{22}X_2 + ... + a_{2p}X_p\] \[...\] \[Y_p = a'_pX = a_{p1}X_1 + a_{p2}X_2 + ... + a_{pp}X_p\]

where \(Var(Y_i) = \mathbf{a'_i\sum{a_i}}\) , i = 1, 2, ..., p \(Cov(Y_i, Y_k) = \mathbf{a'_i\sum{a_k}}\) , i, k = 1, 2, ..., p

The principal components refer to the uncorrelated linear combinations \(Y_1, Y_2, ..., Y_p\) which aim to have the largest possible variances.

For the random vector \(\mathbf{X'}=\left [ X_1, X_2, ..., X_p \right ]\), if \(\mathbf{\sum}\) be the associated covariance matrix, then \(\mathbf{\sum}\) have the eigenvalue-eigenvector pairs \((\lambda_1, e_1), (\lambda_2, e_2), ..., (\lambda_p, e_p)\), and as said \(\lambda_1 \ge \lambda_2 \ge ... \ge \lambda_p \ge 0\).

Then the \(\it{i}\)th principal component is as follows: \[Y_i = \mathbf{e'_iX} = e_{i1}X_1 + e_{i2}X_2 + ... + e_{ip}X_p, i = 1, 2, ..., p\], where \(Var(Y_i) =\mathbf(e'_i\sum{e_i}) = \lambda_i, i = 1, 2, ..., p\) \(Cov(Y_i, Y_k) = \mathbf{e'_i\sum e_i = 0, i \not\equiv k}\), and: \(\sigma_{11} + \sigma_{22} + ... + \sigma_{pp} = \sum_{i=1}^p{Var(X_i)} = \lambda_1 + \lambda_2 + ... + \lambda_p = \sum_{i=1}^p{Var(Y_i)}\).

Interestingly, Total population variance = \(\sigma_{11} + \sigma_{22} + ... + \sigma_{pp} = \lambda_1 + \lambda_2 + ... + \lambda_{p}\).

Another issues that are significant in PCA analysis are:

  1. The proportion of total variance due to (explained by) the \(\mathit{k}\)th principal component: \[\frac{\lambda_k}{(\lambda_1 + \lambda_2 + ... + \lambda_p)}, k=1, 2, ..., p\]

  2. The correlation coefficients between the components \(Y_i\) and the variables \(X_k\) is as follows: \(\rho_{Y_i, X_k} = \frac{e_{ik}\sqrt{\lambda_i}}{\sqrt{\sigma_{kk}}}\), i,k = 1, 2, ..., p

Please note that PCA can be performed on Covariance or ⁠correlation matrices⁠. And before PCA the data should be centered, generally.

Value

Returns a a list of dataframes

Author(s)

Ali Arminian abeyran@gmail.com

References

(1) https://builtin.com

(2) https://pca4ds.github.io/biplot-and-pca.html.

(3) Johnson, R.A. and Wichern, D.W. 2007. Applied Multivariate Statistical Analysis. Pearson Prentice Hall. 773 p.

Examples

# Case 1: for maize dataset, grain yield

data(maize)
PCA_biplot(maize) # or: PCA_biplot(maize, lowt = FALSE)

# Case 2: for days to maturity (dm) trait of chickpea

data(dm)
PCA_biplot(dm, lowt = TRUE)


[Package rYWAASB version 0.3 Index]