## Contents

```
function [GenData, A,B] = CreateGenData (LeukData)
```

% (c) Karl Kuschner, College of William and Mary, Dept. of Physics, 2009. % This function creates a generated data set with parameters modeled from % the 2004 Leukemia data set. % % INPUTS % LeukData: Data repository structure with fields: % Intensities: Array of intensity values of size % #cases x #variables % Class: Vector of length "#cases", with discrete values % identifying class of each case (may be integer) % ID: Patient ID array of length #cases, with one or more cols % MZ: Vector of length "#variables" holding labels for variables % % OUTPUTS % GenData : Data structure exactly like "LeukData" % A,B : The subgroups (class 1,2) of the Intensities matrix

## Find baseline values for all peaks

GenData = LeukData; int=GenData.Intensities; intN=int(GenData.Class==1,:); clear int; intN(intN<1)=1; intN(:,200)=intN(:,148); %add an extra variable, copying a peak from Leuk GenData.MZ(200)=GenData.MZ(199)+100; % Fill in an artificial m/z mu=mean(intN); sig=std(intN); cv=sig./mu; cv(cv>.2)=.2; %from data sig=cv.*mu; clear cv;

## Build randomized peaks from baseline values

A set of spectra, with the number of cases approximating the number of unique patient identification numbers in the Leukemia data, is generated via a draw from a ~N(mu,sigma) (normal) distribution for each variable individually. mu is the value of the average spectrum at that peak position, sigma is estimated from the Leukemia data set population. At this point there should be no real distinction between any variables.

for i=1:150 GenInt(i,:)=random('normal',mu, sig); end clear i mu sig;

## Designate disease class, separate subgroups

One half of the population is designated to be in the disease class. A class vector representing this choice is created and attached to the data.

class(1:75)=1; class(76:150)=2;

## Build diagnostic features

One feature (labeled 200) is chosen as "highly diagnostic" and the mean values of the two subpopulations (normal and disease) are separated by at two times the population's average standard deviation. Specifically, the disease cases are redrawn from ~N(mu+3sig,sig).

mu=mean(GenInt); sig=.2*mu(200); v200(1:75)=normrnd(mu(200),sig, 75,1); v200(76:150)=normrnd(mu(200)+2*sig,sig, 75,1); % Note mean is mu+2sigma v200=v200';figure();bar(v200); GenInt(:,200)=v200; clear mu sig % A random fraction (about a tenth) of the total value of this % feature is placed into each of four adjacent features (labeled 195-199). % In this manner, five diagnostic features are created, correlated to the % parent feature. This procedure mimics the measurement of % adducts or modifications in the real data set, wherein slightly modified % molecules show up as a peak separate from the original. v196to9fact=(rand(150,4)*.04)+.08; % 8 to 12% for i=1:4 v196to9add(:,i)=v200.*v196to9fact(:,1); %#ok<*AGROW> end GenInt(:,196:199)=GenInt(:,196:199)+v196to9add; % Another small fraction of the value of the key peak is moved into % a feature some distance away in the list (labeled 100), representing a % multiply-charged ion (m/2z). This is repeated to a different feature % (labeld 99) for one of the adducts. v99to100fac=(rand(150,2)*.1)+.1; v99to100add=GenInt(:,199:200).*v99to100fac; GenInt(:,99:100)=GenInt(:,99:100)+v99to100add; clear v99to100add v99to100fac % Another diagnostic feature (1.5 sigma separation) is created but % not added to the feature list. Instead, a random amount of the total % value of that feature (itself chosen from a normal distribution) is % placed in two non-adjacent features (labeled 50 and 150). This % represents the breaking apart of a biomarker protein, whose mass is too % great to be detected, into several fragment molecules that are in the % range of measurement. mu=mean(GenInt(:,50)); sig=.2*mu; v201(1:75)=normrnd(mu,sig, 75,1); v201(76:150)=normrnd(mu+1.5*sig,sig, 75,1); v201=v201'; fragfac1=(rand(150,1))*.4; fragfac2=1-fragfac1; v50add=v201.*fragfac1; v150add=v201.*fragfac2; GenInt(:,50)=GenInt(:,50)+v50add; GenInt(:,150)=GenInt(:,50)+v150add; clear fragfac1 fragfac2 mu sig v150add v50add v201 % Two more features (labeled 1 and 2) are selected as "moderately % diagnostic" and the values chosen from two normal distributions whose % means are separated by about two standard deviations of either group. % Specifically, the disease cases are redrawn from ~N(mu+1.5sig,sig). One % of these two features has a portion of the other feature's value added to % it to represent an unsuccessful deconvolution of two peaks that are so % close together the peak value of one is "riding up" on the tail of % another. Two mildly diagnostic features (3 and 4) are created without % this effect. for i=1:2 mu=mean(GenInt(:,i)); sig=.2*mu; GenInt(1:75,i)=normrnd(mu,sig, 75,1); GenInt(76:150,i)=normrnd(mu+1.5*sig,sig, 75,1); end shoulder=rand(150,1)*.1+.1; % 10-20% GenInt(:,1)=GenInt(:,1)+shoulder.*GenInt(:,2); for i=3:4 mu=mean(GenInt(:,i)); sig=.2*mu; GenInt(1:75,i)=normrnd(mu,sig, 75,1); GenInt(76:150,i)=normrnd(mu+1*sig,sig, 75,1); end clear i mu sig shoulder

## Replicate and de-normalize cases

The cases are replicated three times (the original of each case is discarded) by multiplying each value by normalization factor. For a single data vector X a factor f is first selected from ~U(0.5, 2.0) to replicate the range of total ion current normalization factors found in the Leukemia data.

for i=1:150; for j=1:3 casenum=(i-1)*3+j; % thiscase=GenInt(i,:); basefactor = rand*1.5+.5; % factorvec=rand(1,200)*.1+basefactor; FinInt(casenum, :)=GenInt(i,:)*basefactor; FinCl(casenum)=class(i); FinID(casenum,1)=i; %patient ID FinID(casenum,2)=j; %replicate number end end % de-normalization figure();imagesc(log(GenInt)); GenData.Intensities=FinInt; GenData.ID=FinID; GenData.Class=FinCl'; Int=GenData.Intensities; A=Int(1:225,:); B=Int(226:450,:);clear Int;

end %of CreateGenData function