Enhancer identification using TSS-distal DHSs and p300 and CBP-1 binding sites. === Introduction === Listed positions are a subset of TSS-distal DHSs (human, fly), p300 (human) and CBP-1 (worm) binding sites that are classified as enhancers. The classification was optimized to obtain a high confidence set that is not necessarily very inclusive. === Method Summary === We used a supervised machine learning approach to identify putative enhancers among DNaseI hypersensitive sites (DHSs) and p300 or CBP-1 binding sites, hereafter referred collectively as "regulatory sites". The basic idea is to train a supervised classifier to identify H3K4me1/3 enrichment patterns that distinguish TSS distal regulatory sites (i.e. candidate enhancers) from proximal regulatory sites (i.e., candidate promoters). TSS-distal sites that carry these patterns are classified as putative enhancers. === Extended Methods === Human DHS and p300 binding site coordinates were downloaded from the ENCODE UCSC download page (http://genome.ucsc.edu/ENCODE/downloads.html). When available, only peaks identified in both replicates were retained. DHSs and p300 peaks that were wider than 1 kb were removed. DHS positions in fly cell lines were defined as the 'high-magnitude' positions in DNase I hypersensitivity identified by Kharchenko et al. We applied the same method to identify similar positions in DNase-seq data in fly embryonic stage 14, which roughly correspond to LE stage. Worm MXEMB CBP-1 peaks were determined by SPP with default parameters. CBP-1 peaks that were identified within broad enrichment regions wider than 1 kb were removed. For fly and human cell lines, DHS and p300 data from matching cell types were used. For fly late embryos (14-16h), the DHS data from embryonic stage 14 (10:20–11:20h) was used. For worm EE and L3, CBP-1 data from mixed-embryos was used. To define the TSS-proximal and TSS-distal sites, inclusive TSS lists were obtained by merging ensemble v66 TSSs with GENCODE version 10 for human, and modENCODE transcript annotations for fly and worm, including all alternate sites. Different machine learning algorithms were trained to classify genomic positions as a TSS-distal regulatory site, TSS-proximal regulatory site or neither, based on a pool of TSS-distal (>1 kb) and TSS–proximal (<250bp) regulatory sites and a random set of positions from other places in the genome. The random set included twice as many positions as the TSS-distal site set for each cell type. Five features from each of the two marks, H3K4me1 and H3K4me3, were used for the classification: maximum fold-enrichment within +/-500bp, and four average fold enrichment values in 250bp bins within +/-500bp. The pool of positions was split into two equal test and training sets. The performance of different classifier algorithms was compared using the area under Receiver Operator Characteristics (ROC) curves. For human and fly samples, the best performance was obtained using the Model-based boosting (mboost) algorithm12, whereas for the worm data sets, the Support Vector Machine (SVM) algorithm showed superior performance. TSS-distal sites that in turn get classified as “TSS-distal” make up our enhancer set. In worm, the learned model was used to classify sites within 500-1000bp from the closest TSS, and those classified as TSS-distal were included in the final enhancer set to increase the number of identified sites. Our sets of putative enhancers (hereafter referred to as ‘enhancers’) include roughly 2000 sites in fly cell lines and fly embryos, 600 sites in worm embryos, and 50,000 sites in human cell lines.