IRootLab
An Open-Source MATLAB toolbox for vibrational biospectroscopy
interactive_pcacrossval.m
Go to the documentation of this file.
1 %>@ingroup interactive
2 %>@file
3 %>@brief PCA cross-validation aims to determine the "best" number of PCs to use in PCA
4 %>
5 %> Sense of "best": the number of PCs that give minimum mean reconstruction error
6 %>
7 %> <h3>What is reconstruction error?</h3>
8 %> <ul>
9 %> <li> PCA scores are calculated using the loadings matrix: Y = X*L, where
10 %> <ul>
11 %> <li> X is the testing dataset (spectra horizontally)</li>
12 %> <li> Y is the PCA scores dataset</li>
13 %> <li> L is the loadings matrix (loadings vertically) calculated from the training dataset</li>
14 %> </ul></li>
15 %> <li> Spectra can be reconstructed [with error] by <code>X_hat = Y*L' = X*L*L'</code>
16 %> <li> The reconstruction error is calculated as error = mean_all_i(norm^2(X_hat_i-X_i)), where
17 %> <ul>
18 %> <li> mean_all_i(.) is the mean of all spectra in the testing dataset</li>
19 %> <li> X_i is the i-th spectrum (row) in the testing dataset</li>
20 %> <li> X_hat_i is the i-th reconstructed spectrum (row) of the testing dataset</li>
21 %> </ul>
22 %> </li>
23 %> </ul>
24 %>
25 %> <h3>Why cross-validation?</h3>
26 %> If you measure reconstruction error using the same dataset for training and testing, the error will probably always decrease
27 %> as you add more PCs to Y.
28 %>
29 %> However, if we split the dataset into training and testing datasets, we will try to reconstruct samples that have been
30 %> left out of training. It may happen adding that PCs degrades the generalization of the model (loadings).
31 %>
32 %> <h3>The meaning of k-fold:</h3>
33 %>
34 %> First, suppose you have a dataset with, say, 500 spectra.
35 %>
36 %> <ul>
37 %> <li> 10-fold means that the cross-validation will split the 500 spectra 10 times into 450 training spectra and 50 testing spectra (btw, splitting is not sequential, spectra are taken randomly).</li>
38 %> <li> 20-fold means 20 different training and testing datasets of 475 and 25 spectra respectively.</li>
39 %> <li> 500-fold means 500 different training and testing datasets of 499 and 1 spectrum respectively, i.e., 500-fold, in this case, is equivalent to leave-one-out.</li>
40 %> </ul>
41 %>
42 %> <h3>References</h3>
43 %> Pirouette (Infometrix Inc.) Help Documentation, PCA cross-validation Section.
44 %>
45 %> @sa fcon_pca
46 %>
47 
48 varname = input('Enter dataset variable name: ', 's');
49 dataset = eval([varname ';']);
50 dataset = data_eliminate_var0(dataset, 1e-10);
51 [no, nf] = size(dataset.X);
52 flag_kfold = input('0 - leave-one-out; 1 - k-fold cross-validation: ');
53 if flag_kfold
54  k = input('Enter k for k-fold cross-validation: ');
55 else
56  k = no;
57 end;
58 pc_max = input('Enter maximum number of PCs: ');
59 
60 
61 idxsmap = crossvalind('Kfold', no, k);
62 idxs = 1:no;
63 rsss = zeros(1, pc_max);
64 
65 % PCA block
66 blk_pca = fcon_pca();
67 
68 for i = 1:pc_max
69  fprintf(')))))))))))\n');
70  fprintf('))))))))))) Number of PCs: %d of %d\n', i, pc_max);
71 
72  jj = 0;
73  jjj = 0;
74  rss = 0; % residual sum of squares
75  for j = 1:k
76  idxs_test = idxs(idxsmap == k);
77  idxs_train = idxs;
78  idxs_train(idxs_test) = [];
79 
80  ds_train = dataset.map_rows(idxs_train);
81  ds_test = dataset.map_rows(idxs_test);
82 
83  blk_pca.no_factors = i;
84  blk_pca.boot();
85  blk_pca = blk_pca.train(ds_train);
86 
87  ds_train_pca = blk_pca.use(ds_train);
88  L = blk_pca.L;
89 
90  mm = repmat(mean(ds_train.X), length(idxs_test), 1);
91  rss = rss+mean(sum((ds_test.X-(mm+(ds_test.X-mm)*L*L')).^2, 2));
92 
93  jj = jj+1;
94  jjj = jjj+1;
95 
96  if jj == 10
97  fprintf('))))))))))))) Crossval: %d/%d.\n', j, k);
98  jj = 0;
99  end;
100  end;
101 
102  rsss(i) = rss/k;
103 end;
104 
105 
106 %%
107 
108 figure;
109 plot(1:pc_max, rsss, 'k', 'LineWidth', 2);
110 xlabel('number of PCs');
111 ylabel('mean reconstruction error');
112 title('PCA cross-validation result');
114 
115 fprintf('\n\nThe figure shows the\n [number of PCs] X [mean reconstruction error (mean sum-of-squared-error)]\n ([x] X [y])\nused.\n\nThe number of PCs to use should be\nthe one that gives the minimum y-value.\n\n');
Principal Component Analysis.
Definition: fcon_pca.m:4
Base Block class.
Definition: block.m:2
function data_eliminate_var0(in ds, in threshold)
function format_frank(in F, in scale, in handles)
Analysis Session (AS) base class.
Definition: as.m:6