SPP Tutorial

Since processing of ChIP-seq data can require considerable amount of CPU time, it is often necessary to make use of parallel processing. Many of the methods provided by the package can make use of the parallel processing if the cluster configuration is passed to them. The cluster configuration is based on the snow package, which can be used to allocate either multiple local CPUs or distributed nodes. The example below loads the library and initializes a cluster with 8 processing nodes:

# load the library
library(spp);

# The following section shows how to initialize a cluster of 8 nodes for parallel processing
# see "snow" package manual for details.
library(snow)
cluster <- makeCluster(8);

If the parallel processing is not used, please omit cluster parameter in the examples below, or assigned cluster variable to NULL.

Loading tag data, selecting choosing alignment quality, removing anomalies

The initial steps described in this section can be used to read in tag data from Eland files and select a subset of tags that will be used for further processing. The subsequent steps of the processing pipeline will all work with tag data, stored in a form of lists, with each element of the list corresponding to a chromosome (or whatever the target sequences were) and contains a vector of coordinates to which the 5' ends of tags aligned.

Read in Eland alignment

chip.data <- read.eland.tags("chip.eland.file");
input.data <- read.eland.tags("input.eland.file",max.eland.tag.length=32);

Both variables (chip.data and input.data) are lists with tag coordinates ($tag element) and alignment quality information ($quality), which for Eland is just the number of mismatches.
Note that the input data call specifies max.eland.tag.length parameter, which indicates that the Eland alignment was based on the first 32bp of the tag sequence, not on entire sequence length (which may be longer). Specifying such parameter is necessary to determine 5' end positions of tags mapping to the negative strand.

The next step uses cross-correlation profile to calculate binding peak separation distance, and assess whether inclusion of tags with non-perfect alignment quality improves the cross-correlation peak:

# get binding info from cross-correlation profile
# srange gives the possible range for the size of the protected region;
# srange should be higher than tag length; making the upper boundary too high will increase calculation time
#
# bin - bin tags within the specified number of basepairs to speed up calculation;
# increasing bin size decreases the accuracy of the determined parameters
binding.characteristics <- get.binding.characteristics(chip.data,srange=c(50,500),bin=5,cluster=cluster);

The binding.characteristics structure provides the estimate of the binding peak separation distance, cross-correlation profile itself and tag quality bin acceptance information. If you would like to accept all aligned tags, specify accept.all.tags=T argument to save time.

# print out binding peak separation distance
print(paste("binding peak separation distance =",binding.characteristics$peak$x))

# plot cross-correlation profile
pdf(file="example.crosscorrelation.pdf",width=5,height=5)
par(mar = c(3.5,3.5,1.0,0.5), mgp = c(2,0.65,0), cex = 0.8);
plot(binding.characteristics$cross.correlation,type='l',xlab="strand shift",ylab="cross-correlation");
abline(v=binding.characteristics$peak$x,lty=2,col=2)
dev.off();

# select informative tags based on the binding characteristics
chip.data <- select.informative.tags(chip.data,binding.characteristics);
input.data <- select.informative.tags(input.data,binding.characteristics);

The chip.data and input.data is now a simple list of tag coordinate vectors. The analysis of binding characteristics described above could have been skipped by accepting all of the tag alignments after reading the Eland files (i.e. chip.data <- chip.data$tags). Since version 1.2, the steps can also be skipped by simply not passing the binding.characteristics argument to the select.informative.tags call.

The method calls below will scan along the chromosomes calculating local density of region (can be specified using window.size parameter, default is 200bp), removing or restricting singular positions with extremely high tag count relative to the neighborhood.

# restrict or remove singular positions with very high tag counts
chip.data <- remove.local.tag.anomalies(chip.data);
input.data <- remove.local.tag.anomalies(input.data);

Calculating genome-wide tag density and tag enrichment/depletion profiles

The following calls will calculate smoothed tag density and output it into a WIG file that can be read with genome browsers, such as IGB.

# output smoothed tag density (subtracting re-scaled input) into a WIG file
# note that the tags are shifted by half of the peak separation distance
tag.shift <- round(binding.characteristics$peak$x/2)
smoothed.density <- get.smoothed.tag.density(chip.data,control.tags=input.data,bandwidth=200,step=100,tag.shift=tag.shift);
writewig(smoothed.density,"example.density.wig","Example smoothed, background-subtracted tag density");
rm(smoothed.density);

Use scale.by.dataset.size=T option to normalize the tag density by the total dataset size (to make it comparable across samples sequenced at different sequencing depth. note: this option is typically used without the background subtraction).

The code above can be used to analyze individual samples (i.e. ChIP or input runs). To provide a rough estimate of the enrichment profile (i.e. ChIP signal over input), one can use the get.smoothed.enrichment.mle() method:

smoothed.enrichment.estimate <- get.smoothed.enrichment.mle(chip.data,input.data,bandwidth=200,step=100,tag.shift=tag.shift)
writewig(smoothed.enrichment.estimate,"example.enrichment.wig","Example smoothed maximum likelihood log2 enrichment estimate");

The following calls will scan ChIP and signal tag density to estimate lower bounds of tag enrichment (and upper bound of tag depletion if it is significant) along the genome. The resulting profile gives conservative statistical estimates of log2 fold-enrichment ratios along the genome.
The example below uses a window of 500bp (and background windows of 1, 5, 25 and 50 times that size) and a confidence interval corresponding to 1%.

# output conservative enrichment estimates
# alpha specifies significance level at which confidence intervals will be estimated
enrichment.estimates <- get.conservative.fold.enrichment.profile(chip.data,input.data,fws=500,step=100,alpha=0.01);
writewig(enrichment.estimates,"example.enrichment.estimates.wig","Example conservative fold-enrichment/depletion estimates shown on log2 scale");
rm(enrichment.estimates);

Broad regions of enrichment for a specified scale can be quickly identified using the following command:

broad.clusters <- get.broad.enrichment.clusters(chip.data,input.data,window.size=1e3,z.thr=3,tag.shift=round(binding.characteristics$peak$x/2))
# write out in broadPeak format
write.broadpeak.info(broad.clusters,"example.broadPeak")

Detecting point binding positions

The example below will use WTD method to call binding positions, using FDR of 1% and a window size estimated by the binding.characteristics:

# binding detection parameters
# desired FDR (1%). Alternatively, an E-value can be supplied to the method calls below instead of the fdr parameter
fdr <- 1e-2; 
# the binding.characteristics contains the optimized half-size for binding detection window
detection.window.halfsize <- binding.characteristics$whs;
  
# determine binding positions using wtd method
bp <- find.binding.positions(signal.data=chip.data,control.data=input.data,fdr=fdr,whs=detection.window.halfsize,cluster=cluster)

print(paste("detected",sum(unlist(lapply(bp$npl,function(d) length(d$x)))),"peaks"));
  
# output detected binding positions
output.binding.results(bp,"example.binding.positions.txt");

Alternatively, the binding positions can be determined using MTC method (referred here as lwcc):

bp <- find.binding.positions(signal.data=chip.data,control.data=input.data,fdr=fdr,method=tag.lwcc,whs=detection.window.halfsize,cluster=cluster)

By default, the methods exclude large (>1e4bp) regions that exhibit significantly higher input tag density than expected. To prevent such exclusion use tec.filter=F option.

Broader regions of enrichment associated with the determined peaks can be determined using the following:

bp <- add.broad.peak.regions(chip.data,input.data,bp,window.size=1000,z.thr=3)
# output using narrowPeak format
write.narrowpeak.binding(bp,"example.narrowPeak")

Assessing saturation properties

# determine MSER
# note: this will take approximately 10-15x the amount of time the initial binding detection did
# The saturation criteria here is 99% consistency in the set of binding positions when adding 1e5 tags.
# To ensure convergence the number of subsampled chains (n.chains) should be higher (80)
mser <- get.mser(chip.data,input.data,step.size=1e5,test.agreement=0.99,n.chains=8,cluster=cluster,fdr=fdr,method=tag.wtd,whs=detection.window.halfsize)

print(paste("MSER at a current depth is",mser));

An MSER value of 1 or very near one implies that the set of detected binding positions satisfies saturation criteria without additional selection by fold-enrichment ratios. In other words, the dataset has reached saturation in a traditional sense (absolute saturation).

Finally, the example below will calculate multiple chained subsamples of the current dataset to estimate the dependence between MSER and sequencing depth:

# interpolate MSER dependency on tag count
# note: this requires considerably more calculations than the previous steps (~ 3x more than the first MSER calculation)
# Here we interpolate MSER dependency to determine a point at which MSER of 2 is reached
# The interpolation will be based on the difference in MSER at the current depth, and a depth at 5e5 fewer tags (n.steps=6);
# evaluation of the intermediate points is omitted here to speed up the calculation (excluded.steps parameter)
# A total of 7 chains is used here to speed up calculation, whereas a higher number of chains (50) would give good convergence
msers <- get.mser.interpolation(chip.data,input.data,step.size=1e5,test.agreement=0.99, target.fold.enrichment=2,
 n.chains=7,n.steps=6,cluster=cluster,fdr=fdr,method=tag.wtd,whs=detection.window.halfsize)

print(paste("predicted sequencing depth =",round(unlist(lapply(msers,function(x) x$prediction))/1e6,5)," million tags"))

The interpolation will return NA prediction if the dataset has reached absolute saturation at the current depth. One can use return.chains=T parameter to also return calculated random chains (under msers$chains field) - these can be passed back as "get.mser.interpolation( ..., chains=msers$chains)" to calculate predictions for another target.fold.enrichment value without having to recalculate the random chain predictions.

Initialization and parallel processing

Loading tag data, selecting choosing alignment quality, removing anomalies

Read in Eland alignment

Calculating genome-wide tag density and tag enrichment/depletion profiles

Detecting point binding positions

Assessing saturation properties