Logo

Supplementary Material for Published Papers

Apostolou et al, Cell Stem Cell, 2013

Genome-wide Chromatin Interactions of the Nanog Locus in Pluripotency, Differentiation, and Reprogramming

Abstract: The chromatin state of pluripotency genes has been studied extensively in embryonic stem cells (ESCs) and differentiated cells, but their potential interactions with other parts of the genome remain largely unexplored. Here, we identified a genome-wide, pluripotency-specific interaction network around the Nanog promoter by adapting circular chromosome conformation capture sequencing. This network was rearranged during differentiation and restored in induced pluripotent stem cells. A large fraction of Nanog-interacting loci were bound by Mediator or cohesin in pluripotent cells. Depletion of these proteins from ESCs resulted in a disruption of contacts and the acquisition of a differentiation-specific interaction pattern prior to obvious transcriptional and phenotypic changes. Similarly, the establishment of Nanog interactions during reprogramming often preceded transcriptional upregulation of associated genes, suggesting a causative link. Our results document a complex, pluripotency-specific chromatin “interactome” for Nanog and suggest a functional role for long-range genomic interactions in the maintenance and induction of pluripotency.

Xi et al, PNAS, 2011

Copy number variation detection in whole-genome sequencing data using the Bayesian information criterion

Abstract: DNA copy number variations (CNVs) play an important role in the pathogenesis and progression of cancer and confer susceptibility to a variety of human disorders. Array comparative genomic hybridization (CGH) has been used widely to identify CNVs genome-wide, but the next-generation sequencing technology provides an opportunity to characterize CNVs genome-wide with unprecedented resolution. In this study, we developed an algorithm to detect CNVs from whole-genome sequencing data and applied it to a newly sequenced glioblastoma genome with a matched control. This readdepth algorithm, called BIC-seq, can accurately and efficiently identify CNVs via minimizing the Bayesian information criterion (BIC). Using BIC-seq, we identified hundreds of CNVs as small as 40 bp in the cancer genome sequenced at 10X coverage, while we could only detect large CNVs (>15 Kb) in the array CGH profiles for the same genome. Eighty percent (14/16) of the small variants tested (110 bp to 14 Kb) were experimentally validated by quantitative PCR, demonstrating high sensitivity and true positive rate of the algorithm. We also extended the algorithm to detect recurrent CNVs inmultiple samples as well as deriving error bars for breakpoints using a Gibbs sampling approach. We propose this statistical approach as a principled yet practical and efficient method to estimate CNVs in whole-genome sequencing data.

Larschan et al, Nature, 2011

X chromosome dosage compensation via enhanced transcriptional elongation in Drosophila

Abstract: The evolution of sex chromosomes has resulted in numerous species in which females inherit two X chromosomes but males have a single X, thus requiring dosage compensation. MSL (Male-specific lethal) complex increases transcription on the single X chromosome of Drosophila males to equalize expression of X-linked genes between the sexes. The biochemical mechanisms used for dosage compensation must function over a wide dynamic range of transcription levels and differential expression patterns. It has been proposed that the MSL complex regulates transcriptional elongation to control dosage compensation, a model subsequently supported by mapping of the MSL complex and MSL-dependent histone 4 lysine 16 acetylation to the bodies of X-linked genes in males, with a bias towards 3′ ends. However, experimental analysis of MSL function at the mechanistic level has been challenging owing to the small magnitude of the chromosome-wide effect and the lack of an in vitro system for biochemical analysis. Here we use global run-on sequencing (GRO-seq) to examine the specific effect of the MSL complex on RNA Polymerase II (RNAP II) on a genome-wide level. Results indicate that the MSL complex enhances transcription by facilitating the progression of RNAP II across the bodies of active X-linked genes. Improving transcriptional output downstream of typical gene-specific controls may explain how dosage compensation can be imposed on the diverse set of genes along an entire chromosome.

Kim et al, Cancer Research, 2011

A developmental taxonomy of glioblastoma defined and maintained by microRNAs

Abstract: mRNA expression profiling has suggested the existence of multiple glioblastoma subclasses, but their number and characteristics vary among studies and the etiology underlying their development is unclear. In this study, we analyzed 261 microRNA expression profiles from The Cancer Genome Atlas (TCGA), identifying five clinically and genetically distinct subclasses of glioblastoma that each related to a different neural precursor cell type. These microRNA-based glioblastoma subclasses displayed microRNA and mRNA expression signatures resembling those of radial glia, oligoneuronal precursors, neuronal precursors, neuroepithelial/neural crest precursors, or astrocyte precursors. Each subclass was determined to be genetically distinct, based on the significant differences they displayed in terms of patient race, age, treatment response, and survival. We also identified several microRNAs as potent regulators of subclass-specific gene expression networks in glioblastoma. Foremost among these is miR-9, which suppresses mesenchymal differentiation in glioblastoma by downregulating expression of JAK kinases and inhibiting activation of STAT3. Our findings suggest that microRNAs are important determinants of glioblastoma subclasses through their ability to regulate developmental growth and differentiation programs in several transformed neural precursor cell types. Taken together, our results define developmental microRNA expression signatures that both characterize and contribute to the phenotypic diversity of glioblastoma subclasses, thereby providing an expanded framework for understanding the pathogenesis of glioblastoma in a human neurodevelopmental context.

Tolstorukov et al, Nature Structural and Molecular Biology, 2011

Impact of chromatin structure on sequence variability in the human genome

Abstract: DNA sequence variations in individual genomes give rise to different phenotypes within the same species. One mechanism in this process is the alteration of chromatin structure due to sequence variation that impacts gene regulation. We composed a high-confidence collection of human SNPs and indels based on analysis of publicly available sequencing data and investigated whether the DNA loci associated with stable nucleosome positions are protected against mutations. We addressed how the sequence variation is reflected in the occupancy profiles of nucleosomes bearing different epigenetic modifications on genome scale. We find that indels are depleted around nucleosome positions of all considered types, while SNPs are enriched around the positions of bulk nucleosomes but depleted around the positions of epigenetically modified nucleosomes. These findings indicate an increased level of conservation for the sequences associated with epigenetically modified nucleosomes, highlighting complex organization of the human chromatin.

Kharchenko et al, Nature, 2011

Comprehensive analysis of the chromatin landscape in Drosophila melanogaster.

Abstract: Chromatin is composed of DNA and a variety of modified histones and non-histone proteins, which have an impact on cell differentiation, gene regulation and other key cellular processes. Here we present a genome-wide chromatin landscape for Drosophila melanogaster based on eighteen histone modifications, summarized by nine prevalent combinatorial patterns. Integrative analysis with other data (non-histone chromatin proteins, DNase I hypersensitivity, GRO-Seq reads produced by engaged polymerase, short/long RNA products) reveals discrete characteristics of chromosomes, genes, regulatory elements and other functional domains. We find that active genes display distinct chromatin signatures that are correlated with disparate gene lengths, exon patterns, regulatory functions and genomic contexts. We also demonstrate a diversity of signatures among Polycomb targets that include a subset with paused polymerase. This systematic profiling and integrative analysis of chromatin signatures provides insights into how genomic elements are regulated, and will serve as a resource for future experimental investigations of genome structure and function.

Kim et al, BMC Bioinformatics, 11:432, 2010

rSW-seq: Algorithm for detection of copy number alterations in deep sequencing data

Abstract: We develop a method for identification of copy number alterations in a tumor genome compared to its matched control, based on application of Smith-Waterman algorithm to single-end sequencing data. In a performance test with simulated data, our algorithm shows >90% sensitivity and >90% precision in detecting a single copy number change that contains approximately 500 reads for the normal sample. With 100-bp reads, this corresponds to a ~50 kb region for 1X genome coverage of the human genome. We further refine the algorithm to develop rSW-seq, (recursive Smith-Waterman-seq) to identify alterations in a complex configuration, which are commonly observed in the human cancer genome. To validate our approach, we compare our algorithm with an existing algorithm using simulated and publicly available datasets. We also compare the sequencing-based profiles to microarray-based results.

Peng et al, BMC Bioinformatics, 11:399, 2010

Quantized correlation coefficient for measuring reproducibility of ChIP-chip data

Abstract: Chromatin immunoprecipitation followed by microarray hybridization (ChIP-chip) is used to study protein-DNA interactions and histone modifications on a genome-scale. To ensure data quality, these experiments are usually performed in replicates, and a correlation coefficient between replicates is used often to assess reproducibility. However, the correlation coefficient can be misleading because it is affected not only by the reproducibility of the signal but also by the amount of binding signal present in the data. We develop the Quantized correlation coefficient (QCC) that is much less dependent on the amount of signal. This involves discretization of data into set of quantiles (quantization), a merging procedure to group the background probes, and recalculation of the Pearson correlation coefficient. This procedure reduces the influence of the background noise on the statistic, which then properly focuses more on the reproducibility of the signal. The performance of this procedure is tested in both simulated and real ChIP-chip data. For replicates with different levels of enrichment over background and coverage, we find that QCC reflects reproducibility more accurately and is more robust than the standard Pearson or Spearman correlation coefficients. The quantization and the merging procedure can also suggest a proper quantile threshold for separating signal from background for further analysis.

Dreyfuss et al, Molecular Cancer, 8:71, 2009

Meta-analysis of Glioblastoma multiforme versus Anaplastic astrocytoma identifies robust gene markers

Abstract: Background Anaplastic astrocytoma (AA) and its more aggressive counterpart, glioblastoma multiforme (GBM), are the most common intrinsic brain tumors in adults and are almost universally fatal. A deeper understanding of the molecular relationship of these tumor types is necessary to derive insights into the diagnosis, prognosis, and treatment of gliomas. Although genomewide profiling of expression levels with microarrays can be used to identify differentially expressed genes between these tumor types, comparative studies so far have resulted in gene lists that show little overlap. Results To achieve a more accurate and stable list of the differentially expressed genes and pathways between primary GBM and AA, we performed a meta-analysis using publicly available genome-scale mRNA data sets. There were four data sets with sufficiently large sample sizes of both GBMs and AAs, all of which coincidentally used human U133 platforms from Affymetrix, allowing for easier and more precise integration of data. After scoring genes and pathways within each data set, we combined the statistics across studies using the nonparametric rank sum method to identify the features that differentiate GBMs and AAs. We found >900 statistically significant probe sets after correction for multiple testing from the >22,000 tested. We also used the rank sum approach to select >20 significant Biocarta pathways after correction for multiple testing out of >175 pathways examined. The most significant pathway was the hypoxia-inducible factor (HIF) pathway. Our analysis suggests that many of the most statistically significant genes work together in a HIF1A/VEGF-regulated network to increase angiogenesis and invasion in GBM when compared to AA. Conclusion We have performed a meta-analysis of genome-scale mRNA expression data for 289 human malignant gliomas and have identified a list of >900 probe sets and >20 pathways that are significantly different between GBM and AA. These feature lists could be utilized to aid in diagnosis, prognosis, and grade reduction of high-grade gliomas and to identify genes that were not previously suspected of playing an important role in glioma biology. More generally, this approach suggests that combined analysis of existing data sets can reveal new insights and that the large amount of publicly available cancer data sets should be further utilized in a similar manner.

Gelbart et al, Nature Structural and Molecular Biology, 2009

Drosophila MSL complex globally acetylates H4K16 on the male X chromosome for dosage compensation.

The Drosophila melanogaster male-specific lethal (MSL) complex binds the single male X chromosome to upregulate gene expression to equal that from the two female X chromosomes. However, it has been puzzling that approximately 25% of transcribed genes on the X chromosome do not stably recruit MSL complex. Here we find that almost all active genes on the X chromosome are associated with robust H4 Lys16 acetylation (H4K16ac), the histone modification catalyzed by the MSL complex. The distribution of H4K16ac is much broader than that of the MSL complex, and our results favor the idea that chromosome-wide H4K16ac reflects transient association of the MSL complex, occurring through spreading or chromosomal looping. Our results parallel those of localized Polycomb repressive complex and its more broadly distributed chromatin mark, trimethylated histone H3 Lys27 (H3K27me3), suggesting a common principle for the establishment of active and silenced chromatin domains.

Park et al, Computational Statistics and Data Analysis, 2009

A Permutation Test for determining significance of clusters with applications to spatial and gene expression data

Abstract: Hierarchical clustering is a common procedure for identifying structure in a dataset, and this is frequently used for organizing genomic data. Although more advanced clustering algorithms are available, the simplicity and visual appeal of hierarchical clustering have made it ubiquitous in gene expression data analysis. Hence, even minor improvements in this framework would have significant impact. There is currently no simple and systematic way of assessing and displaying the significance of various clusters in a resulting dendrogram without making certain distributional assumptions or ignoring gene-specific variances. In this work, we introduce a permutation test based on comparing the within-cluster structure of the observed data with those of sample datasets obtained by permuting the cluster membership. We carry out this test at each node of the dendrogram using a statistic derived from the singular value decomposition of variance matrices. The p-values thus obtained provide insight into the significance of each cluster division. Given these values, one can also modify the dendrogram by combining non-significant branches. By adjusting the cut-off level of significance for branches, one can produce dendrograms with a desired level of detail for ease of interpretation. We demonstrate the usefulness of this approach by applying it to illustrative datasets.

Sural et al, Nature Structural and Molecular Biology, 2008

The MSL3 chromodomain directs a key targeting step for dosage compensation of the Drosophila melanogaster X chromosome

Abstract: The male-specific lethal (MSL) complex upregulates the single male X chromosome to achieve dosage compensation in Drosophila melanogaster. We have proposed that MSL recognition of specific entry sites on the X is followed by local targeting of active genes marked by histone H3 trimethylation (H3K36me3). Here we analyze the role of the MSL3 chromodomain in the second targeting step. Using ChIP-chip analysis, we find that MSL3 chromodomain mutants retain binding to chromatin entry sites but show a clear disruption in the full pattern of MSL targeting in vivo, consistent with a loss of spreading. Furthermore, when compared to wild type, chromodomain mutants lack preferential affinity for nucleosomes containing H3K36me3 in vitro. Our results support a model in which activating complexes, similarly to their silencing counterparts, use the nucleosomal binding specificity of their respective chromodomains to spread from initiation sites to flanking chromatin.

Kharchenko et al, Nature Biotechnology, 2008

Design and analysis of ChIP-seq experiments for DNA-binding proteins

Abstract: Recent progress in massively parallel sequencing platforms has enabled genome-wide characterization of DNA-associated proteins using the combination of chromatin immunoprecipitation and sequencing (ChIP-seq). Although a variety of methods exist for analysis of the established alternative ChIP microarray (ChIP-chip), few approaches have been described for processing ChIP-seq data. To fill this gap, we propose an analysis pipeline specifically designed to detect protein-binding positions with high accuracy. Using previously reported data sets for three transcription factors, we illustrate methods for improving tag alignment and correcting for background signals. We compare the sensitivity and spatial precision of three peak detection algorithms with published methods, demonstrating gains in spatial precision when an asymmetric distribution of tags on positive and negative strands is considered. We also analyze the relationship between the depth of sequencing and characteristics of the detected binding positions, and provide a method for estimating the sequencing depth necessary for a desired coverage of protein binding sites.

Alekseyenko et al, Cell, 2008

A sequence motif within chromatin entry sites directs MSL establishment on the Drosophila X chromosome.

Abstract: The Drosophila MSL complex associates with active genes specifically on the male X chromosome to acetylate histone H4 at lysine 16 and increase expression approximately 2-fold. To date, no DNA sequence has been discovered to explain the specificity of MSL binding. We hypothesized that sequence-specific targeting occurs at "chromatin entry sites," but the majority of sites are sequence independent. Here we characterize 150 potential entry sites by ChIP-chip and ChIP-seq and discover a GA-rich MSL recognition element (MRE). The motif is only slightly enriched on the X chromosome ( approximately 2-fold), but this is doubled when considering its preferential location within or 3' to active genes (>4-fold enrichment). When inserted on an autosome, a newly identified site can direct local MSL spreading to flanking active genes. These results provide strong evidence for both sequence-dependent and -independent steps in MSL targeting of dosage compensation to the male X chromosome.

Orford et al, Developmental Cell, 2008

Differential H3K4 methylation identifies developmentally poised hematopoietic genes.

Abstract: Throughout development, cell fate decisions are converted into epigenetic information that determines cellular identity. Covalent histone modifications are heritable epigenetic marks and are hypothesized to play a central role in this process. In this report, we assess the concordance of histone H3 lysine 4 dimethylation (H3K4me2) and trimethylation (H3K4me3) on a genome-wide scale in erythroid development by analyzing pluripotent, multipotent, and unipotent cell types. Although H3K4me2 and H3K4me3 are concordant at most genes, multipotential hematopoietic cells have a subset of genes that are differentially methylated (H3K4me2+/me3-). These genes are transcriptionally silent, highly enriched in lineage-specific hematopoietic genes, and uniquely susceptible to differentiation-induced H3K4 demethylation. Self-renewing embryonic stem cells, which restrict H3K4 methylation to genes that contain CpG islands (CGIs), lack H3K4me2+/me3- genes. These data reveal distinct epigenetic regulation of CGI and non-CGI genes during development and indicate an interactive relationship between DNA sequence and differential H3K4 methylation in lineage-specific differentiation.

Larschan et al, Molecular Cell, 2007

MSL complex is attracted to genes marked by H3K36 trimethylation using a sequence-independent mechanism

Abstract: Dosage compensation equalizes the levels of transcripts encoded on the X chromosome between XY males and XX females. In Drosophila, dosage compensation requires the MSL (Male Specific Lethal) complex, which associates with actively transcribed genes on the single male X chromosome to upregulate transcription approximately two-fold. MSL complex targets genes, with increased binding toward 3’ ends. To search for chromatin modifications associated with MSL binding, we mapped H3K36 trimethylation (H3K36me3) in Drosophila SL2 cells, and found that it marks transcribed genes with a 3’ bias, as in yeast and humans. On the male X chromosome, or when MSL complex is ectopically localized to an autosome, H3K36me3 is a strong predictor of MSL binding. We isolated mutants lacking Set2, the H3K36me3 methyltransferase, and found that Set2 is an essential gene in both sexes of Drosophila. In set2 mutant males, MSL complex can still associate with high affinity sites, which are proposed to mark the X chromosome through DNA sequence elements. Yet, MSL complex exhibits reduced binding to target genes, suggesting a specific role for H3K36me3 in MSL targeting. In addition, we found that recombinant MSL3 protein preferentially binds nucleosomes marked by H3K36me3 in vitro. Our results support a model in which the MSL complex recognizes many of its targets through general features of transcribed genes.

Peng et al, BMC Bioinformatics, 2007

Normalization and experimental design for ChIP-chip data.

Abstract: BACKGROUND Chromatin immunoprecipitation on tiling arrays (ChIP-chip) has been widely used to investigate the DNA binding sites for a variety of proteins on a genome-wide scale. However, several issues in the processing and analysis of ChIP-chip data have not been resolved fully, including the effect of background (mock control) subtraction and normalization within and across arrays.

RESULTS: The binding profiles of Drosophila male-specific lethal (MSL) complex on a tiling array provide a unique opportunity for investigating these topics, as it is known to bind on the X chromosome but not on the autosomes. These large bound and control regions on the same array allow clear evaluation of analytical methods. We introduce a novel normalization scheme specifically designed for ChIP-chip data from dual-channel arrays and demonstrate that this step is critical for correcting systematic dye-bias that may exist in the data. Subtraction of the mock (non-specific antibody or no antibody) control data is generally needed to eliminate the bias, but appropriate normalization obviates the need for mock experiments and increases the correlation among replicates. The idea underlying the normalization can be used subsequently to estimate the background noise level in each array for normalization across arrays. We demonstrate the effectiveness of the methods with the MSL complex binding data and other publicly available data.

CONCLUSION: Proper normalization is essential for ChIP-chip experiments. The proposed normalization technique can correct systematic errors and compensate for the lack of mock control data, thus reducing the experimental cost and producing more accurate results.

Alekseyenko et al, Genes & Development, 2006

High-resolution ChIP-chip analysis reveals that the Drosophila MSL complex selectively identifies active genes on the male X chromosome.

Abstract: X-chromosome dosage compensation in Drosophila requires the male-specific lethal (MSL) complex, which up-regulates gene expression from the single male X chromosome. Here, we define X-chromosome-specific MSL binding at high resolution in two male cell lines and in late-stage embryos. We find that the MSL complex is highly enriched over most expressed genes, with binding biased toward the 3 end of transcription units. The binding patterns are largely similar in the distinct cell types, with ~600 genes clearly bound in all three cases. Genes identified as clearly bound in one cell type and not in another indicate that attraction of MSL complex correlates with expression state. Thus, sequence alone is not sufficient to explain MSL targeting. We propose that the MSL complex recognizes most X-linked genes, but only in the context of chromatin factors or modifications indicative of active transcription. Distinguishing expressed genes from the bulk of the genome is likely to be an important function common to many chromatin organizing and modifying activities.

Hamada et al, Genes & Development, 2005

Global regulation of X chromosomal genes by the MSL complex in Drosophila melanogaster.

Abstract: A long-standing model postulates that X-chromosome dosage compensation in Drosophila occurs by twofold up-regulation of the single male X, but previous data cannot exclude an alternative model, in which male autosomes are down-regulated to balance gene expression. To distinguish between the two models, we used RNA interference to deplete Male-Specific Lethal (MSL) complexes from male-like tissue culture cells. We found that expression of many genes from the X chromosome decreased, while expression from the autosomes was largely unchanged. We conclude that the primary role of the MSL complex is to up-regulate the male X chromosome.

Tian et al, Proc Natl Acad Sci USA, 2005

Discovering statistically significant pathways in expression profiling studies.

Abstract: Accurate and rapid identification of perturbed pathways through the analysis of genome-wide expression profiles facilitates the generation of biological hypotheses. We propose a statistical framework for determining whether a specified group of genes for a pathway has a coordinated association with a phenotype of interest. Several issues on proper hypothesis-testing procedures are clarified. In particular, it is shown that the differences in the correlation structure of each set of genes can lead to a biased comparison among gene sets unless a normalization procedure is applied. We propose statistical tests for two important but different aspects of association for each group of genes. This approach has more statistical power than currently available methods and can result in the discovery of statistically significant pathways that are not detected by other methods. This method is applied to data sets involving diabetes, inflammatory myopathies, and Alzheimer's disease, using gene sets we compiled from various public databases. In the case of inflammatory myopathies, we have correctly identified the known cytotoxic T lymphocyte-mediated autoimmunity in inclusion body myositis. Furthermore, we predicted the presence of dendritic cells in inclusion body myositis and of an IFN-alpha/beta response in dermatomyositis, neither of which was previously described. These predictions have been subsequently corroborated by immunohistochemistry.

Lai et al, Bioinformatics, 2005

Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data.

Abstract: MOTIVATION: Array Comparative Genomic Hybridization (CGH) can reveal chromosomal aberrations in the genomic DNA. These amplifications and deletions at the DNA level are important in the pathogenesis of cancer and other diseases. While a large number of approaches have been proposed for analyzing the large array CGH datasets, the relative merits of these methods in practice are not clear. RESULTS: We compare 11 different algorithms for analyzing array CGH data. These include both segment detection methods and smoothing methods, based on diverse techniques such as mixture models, Hidden Markov Models, maximum likelihood, regression, wavelets and genetic algorithms. We compute the Receiver Operating Characteristic (ROC) curves using simulated data to quantify sensitivity and specificity for various levels of signal-to-noise ratio and different sizes of abnormalities. We also characterize their performance on chromosomal regions of interest in a real dataset obtained from patients with Glioblastoma Multiforme. While comparisons of this type are difficult due to possibly sub-optimal choice of parameters in the methods, they nevertheless reveal general characteristics that are helpful to the biological investigator.