modENCODE/ENCODE Chromatin Data for "Comparative analysis of metazoan chromatin organization"

This collection contains the data described in Ho et al., "Comparative analysis of metazoan chromatin organization", Nature, 2014.

Overall Description

The datasets were generated by the modENCODE (model organism Encyclopedia of DNA Elements) and ENCODE consortia in 2007-2012, funded by National Human Genome Research Institute (NHGRI). Please see http://www.genome.gov/26524507 and http://www.genome.gov/Encode/ for more information about the project.

These data consist of ChIP-seq and ChIP-chip profiles for histone modifications and chromosomal proteins in fly, worm, and human, as well as several related data sets. The ChIP-chip datasets were produced on Affymetrix (fly) or NimbleGen (worm) arrays. The ChIP-seq datasets were generated on the Illumina sequencers.

Related papers and websites:
For background on the ChIP-seq workflow by the consortium, please see Landt et al., ChIP-seq guidelines and practices used by the ENCODE and modENCODE consortia, Genome Research, 2012.

ENCODE-X Browser: We have developed web application for theses chromatin datasets. The main advantage of our web application is that it will allow one to quick see what chromatin-related data are available using faceted browsing and use the IGV browser to view the data, for all three organisms. The chromatin state maps generated in Ho et al, 2014 are automatically loaded in the encode-x browser.

Antibody Validation Database: Antibodies used in the project were rigorously tested, and this database contains the validation data. Please see Egelhofer et al., An assessment of histone-modification antibody quality, Nature Str & Mol Biology, 2011.

modENCODE data portal: This website also allows one to use faceted browsing to select datasets of interest (fly and worm only).

modMine: This warehouse by the modENCODE Data Coordinating Center contains a flexible query interface with access to extensive intermediate and metadata (fly and worm only).

ENCODE data portal: This contains human and mouse ENCODE data.

Gene Expression Omnibus (GEO) and Short Read Archive (SRA): Raw data are available from these two sites. Links to specific datasets are available from the above sites.

Available Data:

ChIP-seq and ChIP-chip data
Input normalized ChIP-seq and ChIP-chip fold enrichment profiles
GC-content
PhastCons scores
Genomic sequence mappability tracks
Coordinates of unassembled genomic sequences
Worm TSS definition based on capRNA-seq (capTSS)
Hi-C defined topological domains
hiHMM chromatin state tracks
Cross-species chromatin browser
Protein-coding gene annotation and RNA-seq gene expression data
Human-Worm-Fly ortholog Lists

ChIP-seq and ChIP-chip data

Chromatin data sheet (Excel table, 287k)

This table contains detailed meta-data for all chromatin datesets, including links to the source file.

Input normalized ChIP-seq and ChIP-chip fold enrichment profiles

ChIP-seq

To enable the cross-species comparisons described in this paper, we have reprocessed all data using MACS. (Due to the slight differences in the peak-calling and input normalization steps, there may be slight discrepancies between the fly profiles analyzed here and profiles available at the modENCODE data portal or modMine.

For every pair of aligned ChIP and matching input-DNA data, we used MACS version 2 to generate fold enrichment signal tracks for every position in a genome:
macs2 callpeak -t ChIP.bam -c Input.bam -B --nomodel --shiftsize 73 --SPMR -g hs -n ChIP
macs2 bdgcmp -t ChIP_treat_pileup.bdg -c ChIP_control_lambda.bdg -o ChIP_FE.bedgraph -m FE

ChIP-chip

For the fly data, genomic DNA Tiling Arrays v2.0 (Affymetrix) were used to hybridize ChIP and input DNA. We obtained the log-intensity ratio values (M-values) for all perfect match (PM) probes: M = log2(ChIP intensity) - log2(input intensity), and performed a whole-genome baseline shift so that the mean of M in each microarray is equal to 0. The smoothed log intensity ratios were calculated using LOWESS with a smoothing span corresponding to 500 bp, combining normalized data from two replicate experiments. For the worm data, a custom Nimblegen two-channel whole genome microarray platform was used to hybridize both ChIP and input DNA. MA2C was used to preprocess the data to obtain a normalized and median centered log2 ratio for each probe. All data are publicly accessible through the modENCODE data portal or modMine.
The input normalized profiles are availabe at ENCODE-X browser.

DNase-seq data

Aligned DNase-seq data were downloaded from the modENCODE data portal and the ENCODE UCSC download page. Additional Drosophila embryo DNase-seq data were downloaded. After confirming consistency, reads from biological replicates were merged. We calculated minimally-smoothed signals (by a Gaussian kernel smoother with bandwidth of 10 bp in fly and 50 bp in human) along the genome in 10 bp (fly) or 50 bp (human) non-overlapping bins.

MNase-seq data

The MNase-seq data were analyzed as described previously38. In brief, tags were mapped to the corresponding reference genome assemblies. The positions at which the number of mapped tags had a Z-score > 7 were considered anomalous due to potential amplification bias. The tags mapped to such positions were discarded. To compute profiles of nucleosomal frequency around TSS, the centers of the fragments were used in the case of paired-end data. In the case of single-end data, tag positions were shifted by the half of the estimated fragment size (estimated using cross-correlation analysis39 toward the fragment 3�-ends and tags mapping to positive and negative DNA strands were combined). Loess smoothing in the 11-bp window, which does not affect positions of the major minima and maxima on the plots, was applied to reduce the highfrequency noise in the profiles.

GC-content

We downloaded the 5bp GC% data from the UCSC genome browser annotation download page (http://hgdownload.cse.ucsc.edu/downloads.html) for

Centering at every 5 bp bin, we calculated the running median of the GC% of the surrounding 100 bp (i.e., 105 bp in total). GC scores were then binned into 10 bp (fly and worm) or 50 bp (human) non-overlapping bins.

PhastCons scores

PhastCons scores were then binned into 10 bp (fly and worm) or 50 bp (human) non-overlapping bins.

Genomic sequence mappability tracks

We generated empirial genomic sequence mappability tracks using input-DNA sequencing data. After merging input reads up to 100M, reads were extended to 149 bp which corresponds to the shift of 74 bp in signal tracks. The union set of empirically mapped regions was obtained. They are available here:

Coordinates of unassembled genomic sequences

We downloaded the "Gap" table from the UCSC genome browser download page (http://hgdownload.cse.ucsc.edu/downloads.html):

human (hg19)
fly (dm3) (search for chr*_gap.txt.gz)
worm (ce10) (There are no known unassembled regions in worm)

Worm TSS definition based on capRNA-seq (capTSS)

We obtained worm TSS definition based on capRNA-seq from Chen et al. "The landscape of RNA polymerase II transcription initiation in C. elegans reveals promoter and enhancer architectures". Briefly, short 5'-capped RNA from total nuclear RNA of mixed stage embryos were sequenced (i.e., capRNA- seq) by Illumina GAIIA (SE36) with two biological replicates. Reads from capRNA-seq were mapped to WS220 reference genome using BWA29. Transcription initiation regions (TICs) were identified by clustering of capRNA-seq reads. In this analysis we used TICs that overlap with wormbase TSSs within -199:100bp. We refer these capRNA-seq defined TSSs as capTSS in this study.

Hi-C defined topological domains

The data were downloaded from published paper XXX and YYY. Here are the genomic coordinates used in our study.

human (hg19)
fly (dm3)
worm (ce10) (There are no Hi-C data in worm)

hiHMM chromatin state tracks

The code and instruction for running hiHMM can be accessed here.

Cross-species chromatin browser

The chromatin state definition can be accessed via the ENCODE-X Browser.

Protein-coding gene annotation and RNA-seq gene expression data

Gene expression data can be accessed from the modENCODE/ENCODE transcription page.

Human protein-coding gene annotation, in gtf format, from GENCODE v10: gen10_CDS+exons_only_protein-coding_only.gtf.gz
Worm protein-coding gene annotation, in gtf format, from modENCODE June 2012 freeze: AG1201.integrated_transcripts_strictly_coding.ws220.gtf.gz
Fly protein-coding gene annotation, in gtf format, from modENCODE June 2012 freeze: coding_Celniker_Drosophila_Annotation_20120616_1428.gtf.gz

Human-Worm-Fly ortholog lists

MIT Human-Worm-Fly Orthologs: Modencode.merged.orth20120611_wfh_comm_all.csv