Home |
MSigDB |
Expression |
Model |
Output
GSS Model
Description
In this computational model to predict breast cancer prognosis based
on microarray gene expression data, we use prior knowledge, in the
form of pre-specified gene sets from the Molecular Signatures Database
(MSigDB) dataset.
We compare features derived from the gene sets with
features based on individual genes, with respect to the following
criteria:
- discrimination: ability to predict metastasis within 5
years, both on average and its variance;
- stability of the ranks of individual features within datasets;
- concordance between the weights and ranks of features from different datasets;
- and the underlying biological process pointed to by the features
The purpose of the set
statistic is to reduce the set's expression matrix to a single vector,
which is then used as a feature for classification. The intention is
for the set statistic to be representative of the expression levels of
the set, in a useful way. The different set statistics used in this
work are all unsupervised, in the sense that they do not take into
account the metastatic class. The set statistics used in this work
are:
- Set centroid
- Set median
- Set medoid
- Set t-statistic
- U-statistic p-value
- 1st principal component of the set (set PC)
To measure the concordance between datasets, we perform internal and
external validation. For internal validation, we estimate the
classifier's generalisation inside each dataset, using repeated random
subsampling; the subsampling is used to form a bagged classifier for
each dataset. External validation is then performed, where the bagged
classifier from each dataset is used to predict the metastatic class
of patients from another dataset. In the internal validation, we use
repeated random subsampling to estimate the classifier's internal
generalisation error, as measured by AUC (area under
receiver-operating characteristic curve).
Process
There are three main steps in the GSS analysis:
- preprocessing the input data to usable form
- creating the gene set datasets from the
expression data based on
MSigDB lists
- running the analysis on the gene set data
Preprocessing
- Affymetrix quality-control probesets are removed
- Samples are matched with annotation
- Data are log2 transformed if they are on the original
scale.
- Samples with non-informative censoring (censored before 5 years) are removed
Creating the gene sets
- The MSigDB lists are mapped to Affymetrix probesets
- For each set in each expression dataset, the five set statistics are computed.
Running the classifier
The main classifier used is the centroid classifier (also called
nearest-centroid classifier). For the internal and external
validation, we also use a support vector machine with linear kernel
(R package kernlab),
the van 't Veer classifier, and the PAM shrunken nearest centroid
(R package PAMR).
Instructions
This section assumes you have all the input files.
The script make-msigdb.sh (part of
GeneSetStats.tar.gz) performs the preprocessing, set creation, and running the classifiers.
The code for the classifiers is in the GeneSets R package
Rights
GPLv3 License. There are no restrictions for use by non-academics.
Citation
Abraham, G; Kowalczyk, A; Loi, S; Haviv, I; Zobel, J. (2011)
Computational Model for Gene Set Analysis to predict breast cancer
prognosis based on microarray gene expression data. Computer
Science and Software Engineering, The University of Melbourne.
doi:10.4225/02/4E9F69C011BC8
Supplement to: Abraham, G; Kowalczyk, A; Loi, S; Haviv, I;
Zobel, J. (2010) Prediction of breast cancer prognosis using gene
set statistics provides signature stability and biological context”
BMC Bioinformatics 11:277. doi: 10.1186/1471-2105-11-277
This page and content is licensed under a Creative Commons Attribution 3.0 Australia License