This website is designed to provide array CGH (comparative genomic hybridization) based copy number profiles of ~8,000 human cancer genomes. The copy number profiles of high-resolution array CGH (Agilent 244K as well as Affymetrix 100K, 250K, 500K, and SNP6.0) were obtained from the Gene Expression Omnibus (GEO), a public repository of microarray datasets. For a description of the method used for data processing and segmentation, please refer to the article Functional genomic analysis of chromosomal aberrations in a compendium of cancer genomes (by Kim et al in Dr. Park's lab; in preparation).
Segmentation files are provided for visual inspection of copy number changes for 8,227 cancer genomes as well as for subsequent functional analyses (e.g. the identification of recurrent alterations or functional enrichment analyses). The segmentation files are CBS-output style (with a header line) with 6 columns (metaData ID
, chromosome
, start
, end
, num.mark
, and seg.mean
). Each row in the file corresponds to an individual genomic segment where chromosome
/start
/end
are genomic coordinates. Seg.mean
is the average log2 ratio of probes in the segment representing the extent of copy number changes (positive and negative values represent copy gains and losses, respectively). Note that the data only contains autosomal segments. According to the segmentation methods (CBS (1) or GLAD (2)) and genomic version (hg18/Build36 and hg19/Build37), four segmentation files are provided.
Metadata for the 8,227 samples is available here (.txt
file). The information includes the MetaCGH ID, three types of GEO accession ID (sample/GSM identifier; study/GSE identifier; platform/GPL identifier) for the corresponding sample, tumor types, tumor subtypes (if available), GEO description, and primary/cell lines (in order of columns in the file). For Affymetrix 100K and 500K platforms, two matched samples in a pair are given in the GEO columns. The Broad's IGV can read this file to sort or filter the samples by certain categories (e.g., tumor (sub)types, primary/cell lines).
This website also provides the analysis files related to the article by Kim et al.
.txt
file). GISTIC2's default options were used for peak calling. For CNV filtering, we used 2,233 HapMap CNVs (.txt
file) (hg18) made available in two publications (5; 6). The overlap of the peaks with known cancer-related genes may provide important clues on the potential functionality of the peaks. We obtained 278 cancer consensus genes (.txt
file) (autosomal, hg18-mapped) from a previous publication (7). Some of the identified peaks may have arisen due to increased genomic instability (e.g., genomic fragile regions). Bignell et al. (8) curated 37 high-resolution genomic fragile regions (.txt
file) which can be used to filter such genomic peaks. To filter germline alterations that can be misidentified as genomic peaks, we collected a list of 57,706 CNVs (.txt
file) from the DGV database for 17 studies each reporting at least 1,000 CNVs.
.pdf
file).