Home | MSigDB | Expression | Model | Output

GSS Model

Description

In this computational model to predict breast cancer prognosis based on microarray gene expression data, we use prior knowledge, in the form of pre-specified gene sets from the Molecular Signatures Database (MSigDB) dataset.

We compare features derived from the gene sets with features based on individual genes, with respect to the following criteria:

The purpose of the set statistic is to reduce the set's expression matrix to a single vector, which is then used as a feature for classification. The intention is for the set statistic to be representative of the expression levels of the set, in a useful way. The different set statistics used in this work are all unsupervised, in the sense that they do not take into account the metastatic class. The set statistics used in this work are: To measure the concordance between datasets, we perform internal and external validation. For internal validation, we estimate the classifier's generalisation inside each dataset, using repeated random subsampling; the subsampling is used to form a bagged classifier for each dataset. External validation is then performed, where the bagged classifier from each dataset is used to predict the metastatic class of patients from another dataset. In the internal validation, we use repeated random subsampling to estimate the classifier's internal generalisation error, as measured by AUC (area under receiver-operating characteristic curve).

Process

There are three main steps in the GSS analysis:

  1. preprocessing the input data to usable form
  2. creating the gene set datasets from the expression data based on MSigDB lists
  3. running the analysis on the gene set data

Preprocessing

  1. Affymetrix quality-control probesets are removed
  2. Samples are matched with annotation
  3. Data are log2 transformed if they are on the original scale.
  4. Samples with non-informative censoring (censored before 5 years) are removed

Creating the gene sets

  1. The MSigDB lists are mapped to Affymetrix probesets
  2. For each set in each expression dataset, the five set statistics are computed.

Running the classifier

The main classifier used is the centroid classifier (also called nearest-centroid classifier). For the internal and external validation, we also use a support vector machine with linear kernel (R package kernlab), the van 't Veer classifier, and the PAM shrunken nearest centroid (R package PAMR).

Instructions

This section assumes you have all the input files.

The script make-msigdb.sh (part of GeneSetStats.tar.gz) performs the preprocessing, set creation, and running the classifiers.

The code for the classifiers is in the GeneSets R package

Rights

GPLv3 License. There are no restrictions for use by non-academics.

Citation

Abraham, G; Kowalczyk, A; Loi, S; Haviv, I; Zobel, J. (2011) Computational Model for Gene Set Analysis to predict breast cancer prognosis based on microarray gene expression data. Computer Science and Software Engineering, The University of Melbourne. doi:10.4225/02/4E9F69C011BC8

Supplement to: Abraham, G; Kowalczyk, A; Loi, S; Haviv, I; Zobel, J. (2010) Prediction of breast cancer prognosis using gene set statistics provides signature stability and biological context” BMC Bioinformatics 11:277. doi: 10.1186/1471-2105-11-277


This page and content is licensed under a Creative Commons Attribution 3.0 Australia License