3 %>@brief PCA cross-validation aims to determine the
"best" number of PCs to use in PCA
5 %> Sense of
"best": the number of PCs that give minimum mean reconstruction error
7 %> <h3>What is reconstruction error?</h3>
9 %> <li> PCA scores are calculated
using the loadings matrix: Y = X*L, where
11 %> <li> X is the testing dataset (spectra horizontally)</li>
12 %> <li> Y is the PCA scores dataset</li>
13 %> <li> L is the loadings matrix (loadings vertically) calculated from the training dataset</li>
15 %> <li> Spectra can be reconstructed [with error] by <code>X_hat = Y*L
' = X*L*L'</code>
16 %> <li> The reconstruction error is calculated
as error = mean_all_i(norm^2(X_hat_i-X_i)), where
18 %> <li> mean_all_i(.) is the mean of all spectra in the testing dataset</li>
19 %> <li> X_i is the i-th spectrum (row) in the testing dataset</li>
20 %> <li> X_hat_i is the i-th reconstructed spectrum (row) of the testing dataset</li>
25 %> <h3>Why cross-validation?</h3>
26 %> If you measure reconstruction error using the same dataset for training and testing, the error will probably always decrease
27 %>
as you add more PCs to Y.
29 %> However, if we split the dataset into training and testing datasets, we will try to reconstruct samples that have been
30 %> left out of training. It may happen adding that PCs degrades the generalization of the model (loadings).
32 %> <h3>The meaning of k-fold:</h3>
34 %> First, suppose you have a dataset with, say, 500 spectra.
37 %> <li> 10-fold means that the cross-validation will split the 500 spectra 10 times into 450 training spectra and 50 testing spectra (btw, splitting is not sequential, spectra are taken randomly).</li>
38 %> <li> 20-fold means 20 different training and testing datasets of 475 and 25 spectra respectively.</li>
39 %> <li> 500-fold means 500 different training and testing datasets of 499 and 1 spectrum respectively, i.e., 500-fold, in this case, is equivalent to leave-one-out.</li>
42 %> <h3>References</h3>
43 %> Pirouette (Infometrix Inc.) Help Documentation, PCA cross-validation Section.
48 varname = input('Enter dataset variable name: ', 's');
49 dataset = eval([varname ';']);
51 [no, nf] = size(dataset.X);
52 flag_kfold = input('0 - leave-one-out; 1 - k-fold cross-validation: ');
54 k = input('Enter k for k-fold cross-validation: ');
58 pc_max = input('Enter maximum number of PCs: ');
61 idxsmap = crossvalind('Kfold', no, k);
63 rsss = zeros(1, pc_max);
69 fprintf(')))))))))))\n');
70 fprintf('))))))))))) Number of PCs: %d of %d\n', i, pc_max);
74 rss = 0; % residual sum of squares
76 idxs_test = idxs(idxsmap == k);
78 idxs_train(idxs_test) = [];
80 ds_train = dataset.map_rows(idxs_train);
81 ds_test = dataset.map_rows(idxs_test);
83 blk_pca.no_factors = i;
85 blk_pca = blk_pca.train(ds_train);
87 ds_train_pca = blk_pca.use(ds_train);
90 mm = repmat(mean(ds_train.X), length(idxs_test), 1);
91 rss = rss+mean(sum((ds_test.X-(mm+(ds_test.X-mm)*L*L')).^2, 2));
97 fprintf('))))))))))))) Crossval: %d/%d.\n', j, k);
109 plot(1:pc_max, rsss, 'k', 'LineWidth', 2);
110 xlabel('number of PCs');
111 ylabel('mean reconstruction error');
112 title('PCA cross-validation result');
115 fprintf('\n\nThe figure shows the\n [number of PCs] X [mean reconstruction error (mean sum-of-squared-error)]\n ([x] X [y])\nused.\n\nThe number of PCs to use should be\nthe one that gives the minimum y-value.\n\n');
Principal Component Analysis.
function data_eliminate_var0(in ds, in threshold)
Analysis Session (AS) base class.