1 %> @brief Dataset
class
3 %> <h3>The @c X
property</h3>
4 %> The @c X
property has dimensions [@ref no]x[@ref nf]. Each row represents one physical spectrum. Each column represents a
7 %> <h3>Classes and Class Levels</h3>
8 %> In IRootLab, dataset @c classes are 0-based, so valid classes will range from @c 0 to @c (\ref nc-1). @c Classes correspond to elements in the
9 %> @c classlabels
property.
13 %> The
class labels may define a <bold>multi-level labelling system</bold>, with different
14 %> levels separated by a vertical slash (
"|"). See the example:
16 %> dataset1.classlabels = { ...
25 %> In the example above, the first level represents the cancer grade, whereas the second level represents the country.
26 %> If a spectrum was taken from an individual who is from Ireland and has Low-grade cancer, its class wil be 4 (remember
32 %> number of "observations" (e.g. spectra)
34 %> number of features (i.e., variables)
43 %> [no]x[nf] matrix. Data matrix
45 %> [no]x[1] vector. Classes. Zero-based (first class is class zero).
47 %> Classes may be negative, with special meanings for negative values (see @ref
get_negative_meaning.m)
49 %> Cell of strings. Class labels
52 %> (optional) [no]x[1] Cell of strings. Group codes (e.g. patient names)
54 %> (optional) [no]x[1] Cell of strings. Observation names (e.g. file names of the individual spectra)
63 %> (optional) Cell of strings. Name of each feature
66 %> x-axis name, defaults to 'Wavenumber (cm^{-1})
'
68 %> x-axis unit, defaults to 'cm^{-1}
'
70 %> y-axis name, defaults to 'Absorbance'
72 %> y-axis unit, defaults to 'a.u.'
77 %> Height of image. Spectra start counting from the bottom left upwards.
79 %> Width of image. Width is actually calculated
as @c no/height . If result is not integer, an error will occur.
81 %> =
'ver'. States how the pixels are organized.
82 %>
'ver': bottom-up, left-right
83 %> 'hor': left-right, bottom-up
86 %> Output (instead of classes). For regression instead of classification
89 %> For easier access than groupcodes
96 properties(SetAccess=
protected)
97 %> fields to be split or merged when dataset is split or merged
98 rowfieldnames = {
'groupcodes',
'groupnumbers',
'obsnames',
'obsids',
'classes',
'X',
'Y',
'splitidxs'};
99 flags_cell = [1, 0, 1, 0, 0, 0, 0, 0];
102 methods(Access=
protected)
104 function s = do_get_html(o)
107 s = cat(2, s, '<h1>Data classes</h1><center>', 10);
108 nl = o.get_no_levels();
111 % List of class labels with number of spectra and number of groups per class
116 cc(1, 1:4) = {
'Index',
'Class label',
'Number of rows',
'Number of groups'};
118 cc(j+1, 1:4) = {j, pie(j).classlabels{1}, pie(j).no, pie(j).no_groups};
120 cc(end, 1:4) = {
'',
'Total', da.no, da.no_groups};
124 % PER-LEVEL list of class labels with number of spectra and number of groups per class
127 s = cat(2, s, '<h2>Level ', int2str(i), '</h2>', 10);
133 cc(1, 1:3) = {
'Class label',
'Number of rows',
'Number of groups'};
135 cc(j+1, 1:3) = {pie(j).classlabels{1}, pie(j).no, pie(j).no_groups};
137 cc(end, 1:3) = {
'Total', da.no, da.no_groups};
143 % Negative classes (outliers etc)
144 if any(o.classes < 0)
145 s = cat(2, s, '<h2>Negative classes</h2>', 10);
146 neg = unique(o.classes(o.classes < 0));
149 cc = cell(nneg+1, 3);
150 cc(1, 1:3) = {
'Class',
'Meaning',
'Number of rows'};
158 s = cat(2, s,
'<hr />', 10);
160 % Class means (figure)
165 s = cat(2, s,
irreport.save_n_close());
169 if ~isempty(o.groupcodes)
170 s = cat(2, s, '<h2>Group list</h2>', 10);
172 gg = unique(o.groupcodes);
173 gg = gg(:); % To make sure that it is a column vector
176 gg{i, 2} = sum(strcmp(o.groupcodes, gg{i, 1}));
178 gg = [{'Group code', 'Number of rows'}; gg]; % Adds title
182 s = cat(2, s,
'</center>', 10, do_get_html@
irobj(o));
190 data.classtitle = 'Dataset';
191 data.color = [5, 171, 191]/255;
197 function z = get.no(data)
202 function z = get.nf(data)
207 function z = get.nonf(data)
212 function z = get.nc(data)
213 z = length(data.classlabels);
217 function z = get.no_groups(data)
218 z = length(unique(data.groupcodes));
221 function z = get.width(data)
226 z = data.no/data.height;
228 irerror('Number of rows is not divisible by the dataset height!');
232 %> Converts group codes to group indexes
233 %> Indexes will point to the "unique(data.groupcodes)" vector
234 function idxs = get_groupidxs_from_groupcodes(data, codes)
237 ref = unique(data.groupcodes);
239 ii = find(strcmp(codes{i}, ref));
241 irerror(sprintf('Group code %s not present in dataset!', codes{i}));
247 %> Converts group indexes to observation indexes
248 %> CAUTION: be sure that idxs_codes contains indexes that point to
249 %> the
"unique(data.groupcodes)" vector
250 function obsidxs = get_obsidxs_from_groupidxs(data, groupidxs)
251 group_code_list = unique(data.groupcodes);
252 v = 1:data.no; %> index vector
254 %> Counts to
pre-allocate
256 for i = 1:length(groupidxs)
257 cnt = cnt+sum(strcmp(group_code_list{groupidxs(i)}, data.groupcodes));
260 %> Loops again to fill
261 obsidxs = zeros(1, cnt);
263 for i = 1:length(groupidxs)
264 vtemp = v(strcmp(group_code_list{groupidxs(i)}, data.groupcodes));
265 vlen = length(vtemp);
266 obsidxs(ptr:ptr+vlen-1) = vtemp;
271 %> Returns the number of levels in @c classlabels
272 function nl = get_no_levels(data)
275 nl = max(nl, sum(data.classlabels{i} ==
'|')+1);
280 %> Checks
if internal variables are
synchronized with some troubleshooting.
281 function data = check(data)
282 if isempty(data.fea_x) || ~isempty(data.X)
283 data.fea_x = 1:data.nf;
289 %> Copies structure fields to
object fields
290 %> Contains a dictionary with many old property names for backward
292 %> Also works when the input is an
object.
293 function data = import_from_struct(data, DATA)
294 temp = setxor(properties(data)', {
'nf',
'nonf',
'no_groups',
'nc',
'width'})
';
295 propmap = repmat(temp, 1, 2);
296 propmap = [propmap; {'x
', 'fea_x
'; 'class_labels
', 'classlabels
'; 'idspectrum_s
', 'obsids
'; 'file_names
', 'obsnames
'; ...
297 'colony_codes
', 'groupcodes
'; 'group_codes
', 'groupcodes
'}]; % Names that changed over time
300 if ~isa(DATA, 'irdata') && ~isa(DATA, 'struct')
301 irerror(['DATA argument is of
class "', class(DATA), '" but should be
"irdata"']);
305 for i = 1:size(propmap, 1)
306 sold = propmap{i, 1};
307 snew = propmap{i, 2};
308 if ismember(sold, ff)
309 data.(snew) = DATA.(sold);
321 %> retains only labels corresponding to classes that exist in the
322 %> dataset, and classes are renumbered accordingly
323 function data = eliminate_unused_classlabels(data)
324 uncl = unique(data.classes(data.classes >= 0)); % gets used classes
325 data.classlabels = data.classlabels(uncl+1);
327 for j = 1:numel(uncl)
328 data.classes(data.classes == uncl(j)) = j-1;
332 %> @brief Populates from a time series
334 %> This function makes X and Y. X will be a Toeplitz matrix.
338 %> @param signal vector s(n)
339 %> @param no_inputs dimensionality of the input data space (aka number of features or nf)
340 %> @param future "prediction task", which will be to predict s(n+future)
341 function data = mount_from_signal(signal, no_inputs, future)
343 len_signal = length(signal);
346 %>This is the maximum number of rows of the dataset before something blows
347 no_rows = len_signal-no_inputs-future;
351 X = zeros(no_rows, no_inputs);
352 Y = zeros(no_rows, 1);
355 %> each data row will stand for [s(n) s(n-1) s(n-2) ...]. This way the dot product between the row and the
356 %> coefficients of a linear filter is a causal convolution.
357 X(i, :) = signal(i+no_inputs-1:-1:i);
358 Y(i, 1) = signal(i+no_inputs-1+future);
367 %> Gets a list with all properties except the ones that will be
369 function pp = get_props_to_copy(data)
370 pp = setxor(properties(data), data.rowfieldnames);
371 pp = setxor(pp, {'flag_params
', 'rowfieldnames
', 'flags_cell
'});
374 %> Makes copy with empty fields whose names are in .rowfieldnames
375 %> Additionally, resets .height
376 function dnew = copy_emptyrows(data)
379 rr = data.rowfieldnames;
382 if data.flags_cell(i)
390 % pp = data.get_props_to_copy();
391 % TIMEFSG = TIMEFSG+toc(tt);
392 % dnew = feval(class(data));
393 % for j = 1:length(pp)
394 % dnew.(pp{j}) = data.(pp{j});
406 %> Splits dataset into one or more datasets using row maps
408 %> @param map 1D or 2D cell array of row indexes
409 %> @param feamap (optional
410 %> @retval out Matrix of datasets.
411 function out = split_map(data, map, feamap, fext)
417 flag_fext = nargin > 3 && ~isempty(fext);
419 if ~fext.flag_trainable
420 irerror('fext parameter must be trainable, otherwise it does not make sense!
');
425 flag_feamap = nargin >= 3 && ~isempty(feamap);
427 rr = data.rowfieldnames;
429 flags = arrayfun(@(k) ~isempty(data.(rr{k})), 1:nr);
431 [nrow, ncol] = size(map);
433 dnew0 = data.copy_emptyrows(); % Model copy
435 if nrow == 0 || ncol == 0
439 for i = nrow:-1:1 % Goes backwards to pre-allocate although MATLAB doesn't know.
441 %> prepares a clone, except
for the fields in rowfieldnames
444 %> maps the rowfieldnames fields
447 if flags(k) %> maps only the fields that are not empty. This allows fields to
448 %> be used or not
as necessary and no error will occur.
449 dnew.(rr{k}) = data.(rr{k})(idxs, :);
453 dnew = dnew.eliminate_unused_classlabels();
458 fextnow = fextnow.train(dnew);
460 dnew = fextnow.use(dnew);
467 out(i, j, :) = dnew.select_features(feamap);
471 %s = cat(2, s, sprintf(
'+split%d,%d+', i, j));
474 %s = cat(2, s, sprintf(
'>-\n'));
482 %> Splits dataset into one or more datasets
using its own splitidxs
property
484 %> @param map 1D or 2D cell array of row indexes
485 %> @retval out Matrix of datasets.
486 function out = split_splitidxs(data)
491 %> Maps rows. Single-output version of split_map()
493 %> Returns new
object
494 function out = map_rows(data, idxnew)
495 out = data.split_map({idxnew});
499 %> Manual feature selection.
502 %> idxs: list of column indexes to select, or cell thereof
503 function out = select_features(data, idxs)
507 out.X = data.X(:, idxs);
508 if ~isempty(data.fea_x)
509 out.fea_x = out.fea_x(idxs);
511 if ~isempty(out.fea_names)
512 out.fea_names = out.fea_names(idxs);
518 out(nell) = data; % Pre-allocation
523 out(i).X = data.X(:, idxs{i});
524 if ~isempty(data.fea_x)
525 out(i).fea_x = out(i).fea_x(idxs{i});
527 if ~isempty(data.fea_names)
528 out(i).fea_names = out(i).fea_names(idxs{i});
532 %>
irverbose(sprintf(
'INFO (data_select_features()): # features before: %>d; # features after: %>d.\n', nfold, data.nf));
536 %> @brief Transforms dataset
using loadings matrix L
538 %> data.X = data.X*L;
539 %> data.xlabel =
'Factor';
540 %> data.ylabel =
'Score';
542 %> @param L[nf][any] Loadings matrix
543 %> @param L_fea_prefix=[] Prefix to make
new feature names.
544 function data = transform_linear(data, L, L_fea_prefix)
546 data.fea_x = 1:data.nf;
547 data.xname = 'Factor';
549 data.yname = 'Score';
552 % Makes feature names
553 if exist('L_fea_prefix', 'var') && ~isempty(L_fea_prefix)
554 data.fea_names = cell(1, data.nf);
556 data.fea_names{i} = [L_fea_prefix int2str(i)];
564 %> @brief Returns the names of the features.
566 %> This
function checks the \c fea_names
property and
if it is empty, it makes feature names on-the-fly using
567 %> the \c fea_x property.
569 %> @param idxs Optional list of indexes to be returned
570 function names = get_fea_names(data, idxs)
571 if ~exist('idxs', 'var')
575 if ~isempty(data.fea_names)
576 names = data.fea_names(idxs);
578 names = cell(1, length(idxs));
579 for i = 1:length(idxs)
580 names{i} = sprintf(
'Feature %g', round(data.fea_x(idxs(i))*10)/10);
585 %> @brief fills in the @ref groupnumbers
property based on the @ref groupcodes
property.
586 function data = make_groupnumbers(data)
587 if isempty(data.groupcodes)
588 irverbose('INFO: Dataset groupcodes is empty!', 1);
592 % Determines the groups
593 codes = unique(data.groupcodes);
595 data.groupnumbers = zeros(data.no, 1);
598 data.groupnumbers(strcmp(codes{i}, data.groupcodes)) = i;
603 %> @brief Makes the dataset properties consistent with each other
605 %> This is both an assertion routine and a
"fixing" routine. The two parts are implemented sequentially, so it will be easy to split
606 %>
this in the future.
608 %> The assertion part will
do a number of checks and
throw an error
if there is no hope of making it a consistent dataset. Fatal
610 %> @arg not empty row fields (listed in the @ref
irdata::rowfieldnames read-only property) of different sizes
613 %> The subsewquent fix part may
do a number of works on the dataset:
614 %> @arg autofill the
"classes" vector
if it is empty (and create a
default class label)
615 %> @arg autogenerate the
"fea_x" vector
if it is empty
616 %> @arg add elements to @ref
irdata::classlabels if class numbers surpass the number of labels
618 function data = assert_fix(data)
621 %%% Consistent number of rows
623 for i = 1:numel(data.rowfieldnames)
624 ni = size(data.(data.rowfieldnames{i}), 1);
633 irerror(sprintf(
'Fields "%s" and "%s" have different numbers of rows!', data.rowfieldnames{iref}, data.rowfieldnames{i}));
639 %%% Consistent fea_x and nf
640 if ~isempty(data.fea_x) && size(data.fea_x, 2) ~= data.nf
641 irerror(sprintf(
'dataset has %d features, but x-axis vector has %d elements!', data.nf, size(data.fea_x, 2)));
649 if data.no > 0 && isempty(data.classes)
650 data.classes = zeros(data.no, 1);
651 data.classlabels = {
'Class 0'};
655 if data.nf > 0 && isempty(data.fea_x)
656 data.fea_x = 1:data.nf;
660 if ~isempty(data.classlabels) && ~iscell(data.classlabels)
661 irerror('"classlabels" must be a cell!');
663 nceff = max(data.classes)+1;
664 ncthought = numel(data.classlabels);
666 nl = data.get_no_levels();
667 ss =
char('|'*ones(1, nl-1));
668 for i = nceff:-1:ncthought+1
669 data.classlabels{i} = sprintf(
'Class %d%s', i-1, ss);
674 %> Gets weights
for each class
676 %> Weights are inversely proportional to the number of observations in each
class.
678 %> Weights are normalized, so that their sum equals one
679 %> @param exponent =1. Exponent to power all weights before they are normalized to sum=1
680 function ww = get_weights(data, exponent)
681 ww = zeros(1, data.nc);
683 ww(i) = 1/sum(data.classes == (i-1));
686 if nargin > 1 && ~isempty(exponent)
694 %> Changes direction and swaps width and height
696 %> This is called "transpose2" because MATLAB objects have a built-in
697 %> "transpose" already
698 function data = transpose2(data)
701 irerror('Height not provided and dataset does not have height.');
705 irerror(sprintf('Height %d not divisible by %d!', hei, data.no));
708 if strcmp(data.direction, 'hor')
709 data.direction = 'ver';
711 data.direction = 'hor';
716 %> Asserts that there is no NaN in data.X
717 function data = assert_not_nan(data)
718 if any(isnan(data.X))
719 irerror('Dataset X property has NaNs!!!');
721 if any(isnan(data.classes))
722 irerror('Dataset classes property has NaNs!!!');
function irverbose(in s, in level)
function splitidxs2maps(in splitidxs)
function data_select_hierarchy(in data, in hierarchy)
Property classlabels
Cell of strings. Class labels.
function maximize_window(in h, in aspectratio, in normalizedsize)
Pre-processing block base class.
Feature Extraction (Fext) base class.
function boot(in o)
Configures the structure to deal with new type of data.
function cell2html(in cc, in flag_header, in flag_1stcolumn)
function data_split_classes(in data, in hierarchy)
Visualization - Class means.
Select some given class levels.
Analysis Session (AS) base class.
function get_negative_meaning(in x)
Property rowfieldnames
fields to be split or merged when dataset is split or merged
Property fea_x
feature x-axis
Property nf
number of features (i.e., variables)