PCA cross-validation aims to determine the "best" number of PCs to use in PCA.
Sense of "best": the number of PCs that give minimum mean reconstruction error
What is reconstruction error?
-
PCA scores are calculated using the loadings matrix: Y = X*L, where
-
X is the testing dataset (spectra horizontally)
-
Y is the PCA scores dataset
-
L is the loadings matrix (loadings vertically) calculated from the training dataset
-
Spectra can be reconstructed [with error] by
X_hat = Y*L' = X*L*L'
-
The reconstruction error is calculated as error = mean_all_i(norm^2(X_hat_i-X_i)), where
-
mean_all_i(.) is the mean of all spectra in the testing dataset
-
X_i is the i-th spectrum (row) in the testing dataset
-
X_hat_i is the i-th reconstructed spectrum (row) of the testing dataset
Why cross-validation?
If you measure reconstruction error using the same dataset for training and testing, the error will probably always decrease as you add more PCs to Y.
However, if we split the dataset into training and testing datasets, we will try to reconstruct samples that have been left out of training. It may happen adding that PCs degrades the generalization of the model (loadings).
The meaning of k-fold:
First, suppose you have a dataset with, say, 500 spectra.
-
10-fold means that the cross-validation will split the 500 spectra 10 times into 450 training spectra and 50 testing spectra (btw, splitting is not sequential, spectra are taken randomly).
-
20-fold means 20 different training and testing datasets of 475 and 25 spectra respectively.
-
500-fold means 500 different training and testing datasets of 499 and 1 spectrum respectively, i.e., 500-fold, in this case, is equivalent to leave-one-out.
References
Pirouette (Infometrix Inc.) Help Documentation, PCA cross-validation Section.
- See also
- fcon_pca
Definition in file interactive_pcacrossval.m.