BIC-seq2
This site contains information on the BIC-seq2 software, described in
Xi et al, Copy number analysis of whole-genome data using BIC-seq2 and its application to detection of cancer susceptibility variants, Nucleic Acids Research, 2016. (click to go to the journal site)
For non-academic use, please email Dr. Tatiana Demidova-Rice at Harvard University Office of Technology Development (tatiana_demidova-rice@harvard.edu)
Introduction
BICseq2 is an algorithm developed for the normalization of
high-throughput sequencing (HTS) data and detect copy number
variations (CNV) in the genome. BICseq2 can be used for detecting CNVs
with or without a control genome. There are two main components in the
algorithm:
- BICseq2-norm is for normalizing potential biases in the sequencing data. Download code
- BICseq2-seg is for detecting CNVs based on the normalized data given by BICseq2-norm. Download code
The general pipeline using BICseq2 for CNV detection is as
follows.
- Only a case
genome is sequenced and no control genome is available
- Get the
uniquely mapped reads from the bam file (you may use the modified
samtools as provided here).
- Use
BICseq2-norm to remove the biases in the data.
- Use
BICseq2-seg to detect CNVs based on the normalized data.
- Both case genome
and control genome are availabe (In cancer
study, the case genome is a tumor genome and the control genome can be
the matched normal genome)
- Get the uniquely
mapped reads from the case and the control genome bam files,
respectively
- Normalize the
case and control genome individually using BICseq2-norm
- Detect CNV in
the case genome based on the normalized data
of the case genome and the conrol genome.
BICseq2-norm usage
Before using BICseq2-norm, you have to first compile the C code. To
compile, you may simply type
make
clean
make
After the compilation, you can use the perl code
BICseq2-norm.pl for normalizattion.
Usage:
BICseq2-norm.pl [options] <configFile>
<output>
Options:
--help
-l=<int>: read length
-s=<int>: fragment size
-p=<float>: a subsample percentage: default 0.0002.
-b=<int>: bin the expected and observed as
<int> bp bins; Default 100.
--gc_bin: if specified, report the GC-content in the bins
--NoMapBin: if specified, do NOT bin the reads according to the
mappability
--bin_only: only bin the reads without normalization
--fig=<string>: plot the read count VS GC figure in the
specified file (in pdf format)
--title=<string>: title of the figure
--tmp=<string>: the tmp directory;
<configFile> specifies the location of the configure file
that
has the necessary information for normalization (see below for the
format of the configure file)
<output> is the file that stores the
parameter estimates in the GAM model. This is not useful for general
users.
The <configFile> has the following format
chromName |
faFile |
MapFile |
readPosFile |
binFileNorm |
chr1 |
chr1.fa |
hg18.CRC.50mer.chr1.txt |
chr1.seq |
chr1.norm.bin |
chr2 |
chr2.fa |
hg18.CRC.50mer.chr2.txt |
chr2.seq |
chr2.norm.bin |
In the <configFile>, the columns
should be tab-delimited. The first row of this file is assumed to be
the header of the file and will be omitted by BICseq2-norm.
The 1st column (chromName) is the chromosome name
The 2nd column (faFile) is the reference sequence of this chromosome
(human hg18 and hg19 are available for download)
The 3rd column (MapFile) is the mappability file of this chromosome
(human hg18 (50bp) and hg19 (50bp and 75bp) are available for download)
The 4th column (readPosFile) is the file that stores all the mapping
positions of all reads that uniquely mapped to this chromosome
The 5th column (binFile) is the file that stores the normalized data.
The data will be binned with the bin size as specified by the option -b
Mappability files:
BICseq2-seg usage
Similar to BICseq2-norm, you can first compile BICseq2-seg by
make
clean
make
After
compilation, you can detect CNV with the perl code BICseq2-seg.pl
Usage:
BICseq2-seg.pl [options] <configFile>
<output>
Options:
--lambda=<float>: the (positive) penalty used for BICseq2
--tmp=<string>: the tmp directory
--help: pring this message
--fig=<string>: plot the CNV profile in a png file
--title=<string>: the title of the figure
--nrm: do not remove likely germline CNVs (with a matched
normal) or segments with bad mappability (without a matched normal)
--bootstrap: perform bootstrap test to assign confidence (only for one
sample case)
--noscale: do not automatically adjust the lambda parameter according
to the noise level in the data
--strict: if specified, use a more stringent method to ajust the lambda
parameter
--control: the data has a control genome
--detail: if specified, print the detailed segmentation result (for
multiSample only)
As with the original BIC-seq algorithm, the --lambda parameter is the
main parameter used for tuning the smoothness of the CNV profile. The
larger the parameter is, the less segments the file profile will have.
The default value is 2.
<configFile> stores the necessary information for
BICseq2-seg to detect CNV
<output> stores the final CNV detection results.
<configFile> has the following format (it should be
tab-delimited and the first row will be treated as header and ignored).
- If
there is no control, the format is
chromName |
binFileNorm |
chr1 |
chr1.norm.bin |
chr2 |
chr2.norm.bin |
The
1st column (chromName) is just the chromosome name
The 2nd column (binFileNorm)
is the normalized bin file as obtained from BICseq2-norm
- If
there is a control, the format is
chromName |
binFileNorm.Case |
binFileNorm.Control |
chr1 |
CaseChr1.norm.bin |
ControlChr1.norm.bin |
chr2 |
CaseChr1.norm.bin |
ControlChr1.norm.bin |
The 2nd column
(binFileNorm.Case)
is the normalized bin file of the case genome as obtained from
BICseq2-norm
The
3rd column (binFileNorm.Control)
is the normalized bin file of the control genome as obtained from
BICseq2-norm
Note: If you have a control, you must specify to "--control"
let BICseq2
know that the data is a
case/control study.
How to cite BIC-seq2:
Xi, R.*, Lee, S., Xia, Y., Kim, T. and Park, P.* (2016) Copy number analysis of whole-genome data using BIC-seq2 and its application to detection of cancer susceptibility variants, Nucleic Acids Research, 44(13):6274-86.
Xi, R., Hadjipanayis, A.G., Luquette, L.J., Kim, T.M., Lee, E., Zhang, J.H., Johnson, M.D., Muzny, D.M., Wheeler, D.A., Kucherlapati, R., and Park, P.* (2011). Copy number alteration detection in sequencing data using the Bayesian information criterion, Proceedings of the National Academy of Sciences, USA, 108(46):E1128-36.
Frequently Asked Questions.
Please see this document