Getting Started with Capture Hi-C Analysis Engine

2019-09-29

Quickstart

The main function for running the CHiCANE method is chicane(). This takes a bam file and bed files with all restriction fragments and targeted restriction fragments as input, and produces a table with fragment interactions and associated p-values.

The package comes with a subset of reads from a capture Hi-C experiment on the Bre80 normal epithelial breast tissue cell line (Baxter et al. 2018). The experiment targeted several breast cancer risk loci in non-coding regions of the genome, and aimed to associate these risk variants with genes. Two of the risk SNPs are rs13387042 and rs16857609, and reads that map to them have been included in the file Bre80_2q35.bam.

library(chicane);
#> Loading required package: gamlss.tr
#> Loading required package: gamlss.dist
#> Loading required package: MASS
#> Loading required package: gamlss
#> Loading required package: splines
#> Loading required package: gamlss.data
#> 
#> Attaching package: 'gamlss.data'
#> The following object is masked from 'package:datasets':
#> 
#>     sleep
#> Loading required package: nlme
#> Loading required package: parallel
#>  **********   GAMLSS Version 5.1-4  **********
#> For more on GAMLSS look at http://www.gamlss.org/
#> Type gamlssNews() to see new features/changes/bug fixes.

# example BAM file, baits, and restriction fragments
bam <- system.file('extdata', 'Bre80_2q35.bam', package = 'chicane');
baits <- system.file('extdata', '2q35.bed', package = 'chicane');
fragments <- system.file('extdata', 'GRCh38_HindIII_chr2.bed.gz', package = 'chicane'); # HindIII fragments on chromosome 2

if( bedtools.installed() ) {
  chicane.results <- chicane(
    bam = bam,
    baits = baits,
    fragments = fragments
    );

    print( chicane.results[ 1:10 ] );
}

The chicane function returns a data.table object sorted by q-value. Each row contains information about the target and bait fragments, and the observed and expected number of reads linking the two fragments.

Pre-Processing Data

The chicane function can operate directly on BAM files. To convert the BAM file into a text file, R calls the shell script prepare_bam.sh. This script relies on bedtools to select reads overlapping the captured fragments and identify the restriction fragments corresponding to each read. The resulting data is read into a data.table object, and the number of reads linking any two fragments is calculated.

This processing can take a while (~45 minutes for a 19GB BAM file on a commodity computer with 13GB RAM), and only needs to be done once for each BAM. To speed up model fitting, it is possible to pre-process the data with the prepare.data() function and run chicane directly on the resulting data table.

if( bedtools.installed() ) {
  interaction.data <- prepare.data(
    bam = bam,
    baits = baits,
    fragments = fragments
    );
}

The interaction data object contains details of the fragments, and the number of reads linking the two fragments.

if( bedtools.installed() ) print(interaction.data);

By default, only combinations of fragments that are detected at least once are included in the data. To also include zero counts, use the include.zeros argument. Setting include.zeros = 'cis' includes all zero counts for bait/ target combinations on the same chromosome, while include.zeros = 'all' includes all possible combinations.

if( bedtools.installed() ) {
  cis.zero.results <- chicane(
    bam = bam,
    baits = baits,
    fragments = fragments,
    include.zeros = 'cis'
    ); 

    print( cis.zero.results[ 1:10 ] );
}

For most experiments including zeros will result in impractically large files. Another option for accounting for zero counts is to specify a truncated distribution through the distribution argument. However, this will slow down model fitting.

The interactions argument of the chicane() function can take either a data.table object from prepare.data or the path to such a table.

if( bedtools.installed() ) {
  file.name <- tempfile('interaction_data');
  write.table(interaction.data, file.name, row.names = FALSE);

  chicane.results <- chicane(interactions = file.name); 
}

Statistical Model

The CHiCANE method models the expected number of reads linking two restriction fragments as a function of the distance between the loci and the “interactibility” of the bait fragment, that is, its inherent propensity to interact with other fragments.

Let \(Y_{ij}\) denote the number of reads linking bait \(i\) and target fragment \(j\), \(d_{ij}\) be the distance between the two fragments, and \(t_i\) denote the number of reads linking the bait \(i\) with another fragment in trans. The CHiCANE model assumes that \(Y_{ij}\) follows a distribution with mean parameter \(\mu_{ij}\), where

\[\begin{equation} \mu_{ij} = \beta_0 + \beta_1\log(d_{ij}) + \beta_2\log(t_i + 1) \end{equation}\]

If the other end fragment \(j\) is also a bait, terms are included to adjust for the trans counts of the fragment \(j\), as well as the product of the trans counts of the two fragments as shown below

\[\begin{equation} \mu_{ij} = \beta_0 + \beta_1\log(d_{ij}) + \beta_2\log(t_i + 1) + \beta_3\log(t_j + 1) + \beta_4\log(t_i + 1)\log(t_j + 1) \end{equation}\]

Once an expected value \(\mu_{ij}\) is obtained for each combination of fragments, a \(p\)-value for the observed counts \(y_{ij}\) exceeding what’s expected by random chance is calculated as

\[\begin{equation} p = P(Y_{ij} \geq y_{ij}) \end{equation}\]

The probability \(P(Y_{ij} \geq y_{ij})\) depends on how the distribution \(Y_{ij} \mid \mu_{ij}\) of the counts conditional on the expected value is modelled.

Distribution of the Counts

The CHiCANE method supports several different distributions of the counts \(Y_{ij}\) conditional on the mean \(\mu_{ij}\). By default the counts are assumed to follow a negative binomial distribution with \(E(Y_{ij}) = exp (\mu_{ij})\). The negative binomial is similar to the Poisson model for counts, but includes an overdispersion term that makes it more appropriate for datasets with high variance. After fitting \(\beta_0, \beta_1, \beta_2\) and the overdispersion term by a maximum likelihood over all pairs of fragments, we use the fitted model to estimate a \(p\)-value for each pair. The Poisson, truncated Poisson, or truncated negative binomial distribution can be specified through the distribution argument.

Negative Binomial

The negative binomial distribution models the counts as \(Y \sim NB(\mu, \theta)\), where \(\mu\) is the mean and \(\theta\) is a dispersion parameter. Under this model, the variance of the counts conditional on the mean \(\mu\) is given by

\[\begin{equation} Var(Y) = \mu + \frac{\mu^2}{\theta} \end{equation}\]

The model is fit using the glm.nb() function in the MASS package. The maximum likelihood estimates (MLEs) of the coefficients \(\hat{\beta_i}\) are used to provide an estimate of the expected number of counts for each combination of restriction fragments. The MLE of the dispersion parameter \(\theta\) is used as the dispersion parameter (size) in the negative binomial model.

Occasionally, the model fitting procedure will result in numerical errors as the result of lack of convergence. This is most often due to lack of overdispersion in the data leading to an unstable estimate of \(\theta\). If the variance is approximately equal to the mean, the \(\theta\) term goes to infinity and the model fails to converge.

When the model throws a numerical warning or fails to converge, the chicane() function will attempt to diagnose if the problem was due to lack of overdispersion. This is done by fitting a negative binomial regerssion model with 25 iterations, and using the resulting fitted values and dispersion parameter to examine the variance of the negative binomial. If the estimated variance at the point with the largest fitted value is less than 0.1% greater than the corresponding estimate of the Poisson variance (i.e. the fitted value at that point), a Poisson distribution is fit instead.

Poisson

Unlike the negative binomial, the Poisson distribution does not allow for the variance of the counts to exceed their mean. This model assumes that \(Y \sim \text{Poisson}(\mu)\), giving \(Var(Y) = \mu\). In practice, capture Hi-C counts often show more variability than can be explained by the Poisson model, leading to false positive interaction calls.

Truncated Negative Binomial

When fitting the model without including unobserved combinations of fragments (see Pre-processing Data), there are no counts of zero in the data. To accommodate this in the statistical model, a truncated Poisson distribution can be fit using the GAMLSS package (Stasinopoulos and Rigby 2016). This model assigns \(P(Y = 0) = 0\), and assumes the probabilities are proportional to the negative binomial probabilities for values \(y > 0\).

Truncated Poisson

Similarly to the truncated negative binomial, the truncated Poisson distribution is supported by setting distribution = 'truncated-poisson'. This model assumes the probabilities are proportional to the Poisson probabilities for values \(y>0\).

Adjusting for Covariates

Adjusting for distance is done in a piecewise linear fashion. Cis-data is ordered by distance, and split into equally sized bins. The count model is then fit separately within each bin, and results are merged at the end by concatenating the results from the individual model fits.

By default, CHiCANE tries to split the data based on distance quantiles, i.e. into 100 equally sized groups. In some cases, particularly when dealing with smaller datasets, the resulting datasets are too small for model fitting. When this occurs, the number of distance bins are halved until all resulting datasets are large enough for the model to be fit. To override the default behaviour, pass the desired number of distance bins to distance.bins. For example, setting distance.bins = 4 results in the count model being fit separately in each quartile. Trans interactions are fit separately from cis interactions.

Filtering Baits and Targets

By default, all baits and targets are included when fitting the CHiCANE model. An alternative approach is to filter out baits and fragments with low (Dryden et al. 2014) or high (Cairns et al. 2016) degree of “interactibility”, as inferred through the trans counts. CHiCANE also supports this model through the bait.filters and target.filters arguments.

Each of these take a vector of length two, where each element corresponds to the proportion of fragments that should be considered to fall below the “lower limit” and “upper limit.” For example, passing in bait.filters = c(0.3, 0.9) means that the baits with the lowest 30% or highest 10% of trans counts will be removed before fitting the model.

Filtering fragments may affect multiple testing correction by changing the number of tests performed.

Multiple Testing Correction

By default, CHiCANE performs multiple testing correction separately for each bait. To change this to a global multiple testing correction, set multiple.testing.correction = 'global' in the main chicane() function.

Merging Biological Replicates

CHiCANE can merge replicates at the data processing step. If more than one BAM file is available, these can be passed as vector to the bam argument of the chicane() and prepare.data() functions. The replicate.merging.method determines how replicates are merged. Available options are ‘sum’ and ‘weighted-sum’.

Adjusting for Custom Covariates

CHiCANE allows users to specify additional covariates that should be added to the model with the adjustment.terms argument. For example, we can add a term to adjust for the chromosome of the target fragment.

data(bre80);
adjusted.results <- chicane(
  interactions = bre80, 
  adjustment.terms = 'target.chr'
  );

print( adjusted.results[ 1:5 ] );
#>                   target.id                  bait.id target.chr
#> 1:    chr19:1155128-1228414 chr2:217035649-217042355      chr19
#> 2: chr1:117125276-117135137 chr2:217035649-217042355       chr1
#> 3:   chr1:14950583-14951791 chr2:217035649-217042355       chr1
#> 4:   chr1:49076960-49080684 chr2:217035649-217042355       chr1
#> 5: chr3:145680459-145682483 chr2:217035649-217042355       chr3
#>    target.start target.end bait.chr bait.start  bait.end bait.to.bait
#> 1:      1155128    1228414     chr2  217035649 217042355        FALSE
#> 2:    117125276  117135137     chr2  217035649 217042355        FALSE
#> 3:     14950583   14951791     chr2  217035649 217042355        FALSE
#> 4:     49076960   49080684     chr2  217035649 217042355        FALSE
#> 5:    145680459  145682483     chr2  217035649 217042355        FALSE
#>    bait.trans.count target.trans.count distance count expected   p.value
#> 1:             9494                  2       NA     2 1.003963 0.2656989
#> 2:             9494                  2       NA     2 1.003992 0.2657096
#> 3:             9494                  2       NA     2 1.003992 0.2657096
#> 4:             9494                  2       NA     2 1.003992 0.2657096
#> 5:             9494                  2       NA     2 1.004014 0.2657179
#>      q.value
#> 1: 0.6417435
#> 2: 0.6417435
#> 3: 0.6417435
#> 4: 0.6417435
#> 5: 0.6417435

The adjustment.terms argument also supports more complex expressions such as log-transformations. Multiple adjustment terms can be added by passing a vector.

Any adjustment term must correspond to a column already present in the data. If you would like to adjust for anything that is not already present in the CHiCANE output (e.g. GC content), there’s a three step approach:

  1. Run prepare.data() on your BAM file
  2. Add extra columns to the resulting data frame
  3. Run chicane() with the interactions argument

Helper Functions

The convert.hicup.digest.bed() function can be used to convert a digested genome in HiCUP format to a BED file.

Session Info

sessionInfo();
#> R Under development (unstable) (2019-09-27 r77229)
#> Platform: x86_64-apple-darwin16.7.0 (64-bit)
#> Running under: macOS Sierra 10.12.6
#> 
#> Matrix products: default
#> BLAS:   /Users/shaider/Desktop/R-devel/lib/libRblas.dylib
#> LAPACK: /Users/shaider/Desktop/R-devel/lib/libRlapack.dylib
#> 
#> locale:
#> [1] C/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#> 
#> attached base packages:
#> [1] parallel  splines   stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.12.2 chicane_0.1.1     gamlss.tr_5.1-0   gamlss_5.1-4     
#> [5] nlme_3.1-141      gamlss.data_5.1-4 gamlss.dist_5.1-4 MASS_7.3-51.4    
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.2        knitr_1.25        magrittr_1.5     
#>  [4] lattice_0.20-38   foreach_1.4.7     stringr_1.4.0    
#>  [7] tools_3.7.0       grid_3.7.0        xfun_0.9         
#> [10] iterators_1.0.12  htmltools_0.3.6   yaml_2.2.0       
#> [13] survival_2.44-1.1 digest_0.6.21     Matrix_1.2-17    
#> [16] codetools_0.2-16  evaluate_0.14     rmarkdown_1.15   
#> [19] stringi_1.4.3     compiler_3.7.0

Acknowledgements

Development of CHiCANE was supported by Breast Cancer Now.

References

Baxter, Joseph S, Olivia C Leavy, Nicola H Dryden, Sarah Maguire, Nichola Johnson, Vita Fedele, Nikiana Simigdala, et al. 2018. “Capture Hi-c Identifies Putative Target Genes at 33 Breast Cancer Risk Loci.” Nature Communications 9 (1). Nature Publishing Group: 1028.

Cairns, Jonathan, Paula Freire-Pritchett, Steven W Wingett, Csilla Várnai, Andrew Dimond, Vincent Plagnol, Daniel Zerbino, et al. 2016. “CHiCAGO: Robust Detection of Dna Looping Interactions in Capture Hi-c Data.” Genome Biology 17 (1). BioMed Central: 127.

Dryden, Nicola H, Laura R Broome, Frank Dudbridge, Nichola Johnson, Nick Orr, Stefan Schoenfelder, Takashi Nagano, et al. 2014. “Unbiased Analysis of Potential Targets of Breast Cancer Susceptibility Loci by Capture Hi-c.” Genome Research 24 (11). Cold Spring Harbor Lab: 1854–68.

Stasinopoulos, Mikis, and Bob Rigby. 2016. Gamlss.tr: Generating and Fitting Truncated “Gamlss.family” Distributions. https://CRAN.R-project.org/package=gamlss.tr.