Description of the MetaCGH database

This website is designed to provide array CGH (comparative genomic hybridization) based copy number profiles of ~8,000 human cancer genomes. The copy number profiles of high-resolution array CGH (Agilent 244K as well as Affymetrix 100K, 250K, 500K, and SNP6.0) were obtained from the Gene Expression Omnibus (GEO), a public repository of microarray datasets. For a description of the method used for data processing and segmentation, please refer to the article Functional genomic analysis of chromosomal aberrations in a compendium of cancer genomes (by Kim et al in Dr. Park's lab; in preparation).

Segmentation files are provided for visual inspection of copy number changes for 8,227 cancer genomes as well as for subsequent functional analyses (e.g. the identification of recurrent alterations or functional enrichment analyses). The segmentation files are CBS-output style (with a header line) with 6 columns (metaData ID, chromosome, start, end, num.mark, and seg.mean). Each row in the file corresponds to an individual genomic segment where chromosome/start/end are genomic coordinates. Seg.mean is the average log2 ratio of probes in the segment representing the extent of copy number changes (positive and negative values represent copy gains and losses, respectively). Note that the data only contains autosomal segments. According to the segmentation methods (CBS (1) or GLAD (2)) and genomic version (hg18/Build36 and hg19/Build37), four segmentation files are provided.

Metadata for the 8,227 samples is available here (.txt file). The information includes the MetaCGH ID, three types of GEO accession ID (sample/GSM identifier; study/GSE identifier; platform/GPL identifier) for the corresponding sample, tumor types, tumor subtypes (if available), GEO description, and primary/cell lines (in order of columns in the file). For Affymetrix 100K and 500K platforms, two matched samples in a pair are given in the GEO columns. The Broad's IGV can read this file to sort or filter the samples by certain categories (e.g., tumor (sub)types, primary/cell lines).

This website also provides the analysis files related to the article by Kim et al.

  1. The peaks identified by GISTIC algorithm (3; 4) across the entire dataset is provided here (.txt file). GISTIC2's default options were used for peak calling. For CNV filtering, we used 2,233 HapMap CNVs (.txt file) (hg18) made available in two publications (5; 6). The overlap of the peaks with known cancer-related genes may provide important clues on the potential functionality of the peaks. We obtained 278 cancer consensus genes (.txt file) (autosomal, hg18-mapped) from a previous publication (7). Some of the identified peaks may have arisen due to increased genomic instability (e.g., genomic fragile regions). Bignell et al. (8) curated 37 high-resolution genomic fragile regions (.txt file) which can be used to filter such genomic peaks. To filter germline alterations that can be misidentified as genomic peaks, we collected a list of 57,706 CNVs (.txt file) from the DGV database for 17 studies each reporting at least 1,000 CNVs.
  2. Tumor type-specific alterations were investigated for 19 tumor types (>100 samples). The set of tumor type-specific alterations (GISTIC output) are provided here. For tumor type-specific alterations, we used a platform effect-adjusted segmentation profile to account for the potential platform-specific biases (see Kim et al).
  3. One potential utility of our database is to explore low-prevalence genomic events such as chromothripsis (9). We observed that 1.5% of 8,227 cancer genomes show significant genomic signatures (i.e., the copy number oscillates rapidly between fixed levels) indicative of chromothripsis (total 209 events in 124 samples). The copy number profiles of the affected chromosomes are shown for chr1 to chr22 here (.pdf file).
  4. Four segmentation files from the two methods (CBS, GLAD) and the two reference sequences (hg18, hg19). You may need to Right Click > Save As to download the file.

Reference list
  1. Olshen, A.B., Venkatraman, E.S., Lucito, R. and Wigler, M. (2004) Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics, 5, 557-572.
  2. Hupe, P., Stransky, N., Thiery, J.P., Radvanyi, F. and Barillot, E. (2004) Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics, 20, 3413-3422.
  3. Beroukhim, R., Getz, G., Nghiemphu, L., Barretina, J., Hsueh, T., Linhart, D., Vivanco, I., Lee, J.C., Huang, J.H., Alexander, S. et al. (2007) Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma. Proc. Natl. Acad. Sci. U. S. A., 104, 20007-20012.
  4. Mermel, C.H., Schumacher, S.E., Hill, B., Meyerson, M.L., Beroukhim, R. and Getz, G. (2011) GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol., 12, R41.
  5. Redon, R., Ishikawa, S., Fitch, K.R., Feuk, L., Perry, G.H., Andrews, T.D., Fiegler, H., Shapero, M.H., Carson, A.R., Chen, W. et al. (2006) Global variation in copy number in the human genome. Nature, 444, 444-454.
  6. McCarroll, S.A. (2008) Integrated detection and population-genetic analysis of SNPs and copy number variation.
  7. Futreal, P.A., Coin, L., Marshall, M., Down, T., Hubbard, T., Wooster, R., Rahman, N. and Stratton, M.R. (2004) A census of human cancer genes. Nat. Rev. Cancer, 4, 177-183.
  8. Bignell, G.R., Greenman, C.D., Davies, H., Butler, A.P., Edkins, S., Andrews, J.M., Buck, G., Chen, L., Beare, D., Latimer, C. et al. (2010) Signatures of mutation and selection in the cancer genome. Nature, 463, 893-898.
  9. Stephens, P.J., Greenman, C.D., Fu, B., Yang, F., Bignell, G.R., Mudie, L.J., Pleasance, E.D., Lau, K.W., Beare, D., Stebbings, L.A. et al. (2011) Massive genomic rearrangement acquired in a single catastrophic event during cancer development. Cell, 144, 27-40.