Calculating the Proportionality Coefficients of Compositional Data

Thomas Quinn

2016-10-31

Introduction

The bioinformatic evaluation of gene co-expression often begins with correlation-based analyses. However, as demonstrated thoroughly in a recent publication, this approach lacks statistical validity when applied to relative data (Lovell 2015). This includes, for example, some of the most frequently studied biological count data, such as those produced by microarray assays or high-throughput RNA-sequencing. As an alternative to correlation, Lovell et al propose a proportionality metric, \(\phi\), as derived from compositional data (CoDa) analysis. A subsequent publication expounded this work by elaborating on another proportionality metric, \(\rho\) (Erb 2016). This package introduces a programmatic framework for the calculation of feature dependence through proportionality, as discussed in the cited publications.

Let \(A_i\) and \(A_j\) each represent a log-ratio transformed feature vector (e.g., a transformed vector of \(d\) gene values measured across \(n\) conditions). We then define the metrics \(\phi\) and \(\rho\) accordingly:

\[\phi(A_i, A_j) = \frac{var(A_i - A_j)}{var(A_i)}\]

\[\rho(A_i, A_j) = 1 - \frac{var(A_i - A_j)}{var(A_i) + var(A_j)}\]

Above, we use the log-ratio transformation in order to normalize the data in a manner that respects the nature of relative data. In other words, log-ratio transformation yields the same result whether applied to absolute or relative data. In this package, we consider two log-ratio transformations of the subject vector \(x\), the centered log-ratio transformation (clr) and the additive log-ratio transformation (alr). We define the metrics \(clr(x)\) and \(alr(x)\) accordingly:

\[\textrm{clr(x)} = \left[\ln\frac{x_i}{g(\textrm{x})};...;\ln\frac{x_D}{g(\textrm{x})}\right]\]

\[\textrm{alr(x)} = \left[\ln\frac{x_i}{x_D};...;\ln\frac{x_{D-1}}{x_D}\right]\]

In clr-transformation, sample vectors undergo normalization based on the logarithm of the ratio between the individual elements and the geometric mean of the vector, \(g(\textrm{x}) = \sqrt[D]{x_i...x_D}\). In alr-transformation, sample vectors undergo normalization based on the logarithm of the ratio between the individual elements and chosen reference feature. Although these transformations differ in definition, we will sometimes will refer to them jointly by the acronym *lr.

Calculating proportionality

We provide two principal functions for calculating proportionality. The first function, phit, implements the calculation of \(\phi\) described in Lovell et al (2015). This function makes use of clr-transformation exclusively. The second function, perb, implements the calculation of \(\rho\) described initially in Lovell et al (2015) and expounded by Erb and Notredame (2016). This function makes use of either clr- or alr-transformation.

The first difference between \(\phi\) and \(\rho\) is scale. The values of \(\phi\) range from \([0, \infty)\), with lower \(\phi\) values indicating more proportionality. The values of \(\rho\) range from \([-1, 1]\), with greater \(|\rho|\) values indicating more proportionality and negative \(\rho\) values indicating inverse proportionality. A second difference is that \(\phi\) lacks symmetry. However, one can force symmetry by reflecting the lower left triangle of the matrix across the diagonal (toggled by the argument symmetrize = TRUE). A third difference is that \(\rho\) corrects for the individual variance of each feature in the pair, rather than for just one of the features as in \(\phi\).

For now, we will focus on the implementations that use clr-transformation, saving a discussion of alr-transformation for later. Let us begin by building an arbitrary dataset of 4 features (e.g., genes) measured across 100 subjects. In this example dataset, the feature pairs “a” and “b” will show proportional change as well as the feature pairs “c” and “d”.

set.seed(12345)
N <- 100
X <- data.frame(a=(1:N), b=(1:N) * rnorm(N, 10, 0.1),
                c=(N:1), d=(N:1) * rnorm(N, 10, 1.0))

Let \(d\) represent any number of features measured across \(n\) observations undergoing a binary or continuous event \(E\). For example, \(n\) could represent subjects differing in case-control status, treatment status, treatment dose, or time. The phit and perb functions ultimately convert a “count matrix” with \(n\) rows and \(d\) columns into a proportionality matrix of \(d\) rows and \(d\) columns containing a \(\phi\) or \(\rho\) measurement for each feature pair. One can think of this matrix as analogous to a dissimilarity matrix (in the case of \(\phi\)) or a correlation matrix (in the case of \(\rho\)). Both functions return the proportionality matrix bundled within an object of the class propr. This object contains four slots:

library(propr)
phi <- phit(X, symmetrize = TRUE)
rho <- perb(X, ivar = 0)

Subsetting propr objects

We have provided methods for indexing and subsetting objects belonging to the propr class. Using the familiar [ method, we can efficiently index the proportionality matrix (@matrix) based on an inequality operator and a reference value. By design, this method never modifies the proportionality matrix, making it scale well with large datasets.

In this first example, we use [ to index the matrix by \(\rho > .99\). This indexes the location of all values (i.e., in the lower left triangle of the matrix) satisfying that inequality, and saves those indices to the @pairs slot. Indexing helps guide bundled visualization methods in lieu of copy-on-modify subsetting.

rho99 <- rho[">", .99]
rho99@pairs
## [1]  2 12

Alternatively, using the subset method, we can subset an entire propr object by a vector of feature indices or names. The subset method also provides a convenient way to re-order feature and subject vectors for downstream visualization tools (e.g., image). However, this method does copy-on-modify the proportionality matrix, making it unsuitable for large datasets.

In this second example, we subset by the feature names “a” and “b”.

rhoab <- subset(rho, select = c("a", "b"))
rhoab@matrix
##           [,1]      [,2]
## [1,] 1.0000000 0.9999151
## [2,] 0.9999151 1.0000000

The convenience function, simplify, can subset an entire propr object based on the index saved to its @pairs slot. This function converts the saved index into a paired list of coordinates and passes them along to the subset method. As such, this method does copy-on-modify the proportionality matrix, making it unsuitable for large datasets. Unlike subset, simplify returns an object with the @pairs slot updated.

simplify(rho99)
## @counts summary: 100 subjects by 4 features
## @logratio summary: 100 subjects by 4 features
## @matrix summary: 4 features by 4 features
## @pairs summary: 2 feature pairs

Visualizing pairs

Each feature belonging to a highly proportional data pair should show approximately linearly correlated *lr-transformed expression with one another across all subjects. The method plot provides a means by which to visually inspect whether this holds true. Since this function will plot all pairs unless indexed with the [ method, we recommend the user first index or subset the propr object before plotting. “Noisy” correlation between some feature pairs could suggest that the proportionality cutoff is too lenient. We include this plot as a handy “sanity check” when working with high-dimensional datasets.

plot(rho99)

Computational burden

Both microarray technology and high-throughput genomic sequencing have the ability to measure tens of thousands of features for each subject. Since calculating proportionality generates a matrix sized \(d^2\), this method uses a lot of RAM when applied to real biological datasets. To address this issue, the newest version of propr harnesses the power of C++ (via the Rcpp package) to achieve a near 100-fold increase in computational speed and an 80% reduction in RAM overhead. Below, we provide a small table that estimates the approximate amount of RAM needed to render a proportionality matrix based on the number of features studied. The user should account for up to 25% more MiB in additional RAM for subsequent [ indexing and visualization.

Features Peak RAM (MiB)
1000 8
2000 31
4000 123
8000 491
16000 1959
24000 4405
32000 7829
64000 31301
100000 76406

An in-depth look at clr

We recognize that this package builds off concepts that are not necessarily intuitive. Since the log-ratio transformation of relative data comprises a major portion of proportionality analysis, we decided to dedicate some extra space to this topic specifically. In this section, we discuss the centered log-ratio (clr) and its limitations in context of proportionality analysis. To this end, we begin by simulating count data for 5 features (e.g., genes) labeled “a”, “b”, “c”, “d”, and “e”, as measured across 100 subjects.

N <- 100
a <- seq(from = 5, to = 15, length.out = N)
b <- a * rnorm(N, mean = 1, sd = 0.1)
c <- rnorm(N, mean = 10)
d <- rnorm(N, mean = 10)
e <- rep(10, N)
X <- data.frame(a, b, c, d, e)

Let us assume that these data \(X\) represent absolute abundance counts (i.e., not relative data). We can build a relative dataset, \(Y\), by distorting \(X\) accordingly:

Y <- X / rowSums(X) * abs(rnorm(N))

As a “sanity check”, we will confirm that these new feature vectors do in fact contain relative quantities. We do this by calculating the ratio of the second feature vector to the first for both the absolute and relative datasets.

all(round(X[, 2] / X[, 1] - Y[, 2] / Y[, 1], 5) == 0)
## [1] TRUE

The following figures compare pairwise scatterplots for the absolute count data and the corresponding relative count data. We see quickly how these relative data suggest a spurious correlation: although genes “c” and “d” do not correlate with one another absolutely, their relative quantities do.

pairs(X)
pairs(Y)

Next, we will see that when we do calculate correlation, the coefficients differ for the absolute and relative datasets. This further demonstrates the spurious correlation.

cor(X)
##             a          b           c          d  e
## a  1.00000000  0.9495487 -0.08429201 -0.1284406 NA
## b  0.94954870  1.0000000 -0.17278967 -0.1183455 NA
## c -0.08429201 -0.1727897  1.00000000 -0.1271698 NA
## d -0.12844062 -0.1183455 -0.12716985  1.0000000 NA
## e          NA         NA          NA         NA  1
cor(Y)
##           a         b         c         d         e
## a 1.0000000 0.9918545 0.8606885 0.8700002 0.8630598
## b 0.9918545 1.0000000 0.8553602 0.8677473 0.8622694
## c 0.8606885 0.8553602 1.0000000 0.9857120 0.9923988
## d 0.8700002 0.8677473 0.9857120 1.0000000 0.9909547
## e 0.8630598 0.8622694 0.9923988 0.9909547 1.0000000

However, by calculating the variance of the log-ratios (vlr), defined as the variance of the logarithm of the ratio of two feature vectors, we can arrive at a single measure of dependence that (a) does not change with respect to the nature of the data (i.e., absolute or relative), and (b) does not change with respect to the number of features included in the computation. As such, the vlr, constituting the numerator portion of the \(\phi\) metric and a portion of the \(\rho\) metric as well, is sub-compositionally coherent. Yet, while vlr yields valid results for compositional data, it lacks a meaningful scale.

propr:::proprVLR(Y[, 1:4])
##             a           b          c          d
## a 0.000000000 0.009007394 0.11273963 0.11192702
## b 0.009007394 0.000000000 0.12431341 0.11769259
## c 0.112739635 0.124313413 0.00000000 0.01986009
## d 0.111927021 0.117692593 0.01986009 0.00000000
propr:::proprVLR(X)
##             a           b           c           d           e
## a 0.000000000 0.009007394 0.112739635 0.111927021 0.097960496
## b 0.009007394 0.000000000 0.124313413 0.117692593 0.104219359
## c 0.112739635 0.124313413 0.000000000 0.019860086 0.009516737
## d 0.111927021 0.117692593 0.019860086 0.000000000 0.008167461
## e 0.097960496 0.104219359 0.009516737 0.008167461 0.000000000

Similarly, transformation of a counts matrix by clr also makes the data sub-compositionally coherent. In the calculation of proportionality coefficients, we use the variance about the clr-transformed data to normalize the variance of the log-ratios (vlr). In other words, we adjust the arbitrarily defined vlr by the variance of its individual constituents. In this way, the use of clr-transformed data shifts the vlr-matrix onto a “standardized” scale that compares across all feature pairs.

In the next figures, we compare pairwise scatterplots for the clr-transformed absolute count data and the corresponding clr-transformed relative count data. While equivalent, we see a relationship between “c” and “d” that should not exist based on what we know from the non-transformed absolute count data. This relationship is ultimately reflected (at least partially) in the results of phit and perb alike.

pairs(propr:::proprCLR(Y[, 1:4]))
pairs(propr:::proprCLR(X))

However, division of the vlr by the variance of the clr lacks sub-compositional coherence. As such, neither \(\phi\) nor \(\rho\), at least when calculated via clr, yield the same result for absolute and relative data. This may explain why these methods do not, per se, prevent the possible discovery of spurious proportionality.

phit(Y[, 1:4])@matrix
## Calculating phi from "count matrix".
##          [,1]     [,2]      [,3]      [,4]
## [1,] 0.000000 0.328171 4.1075015 4.0778951
## [2,] 0.328171 0.000000 3.9114296 3.7031104
## [3,] 4.107501 3.911430 0.0000000 0.5971697
## [4,] 4.077895 3.703110 0.5971697 0.0000000
phit(X)@matrix
## Calculating phi from "count matrix".
##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] 0.0000000 0.2388549 2.9895895 2.9680409 2.5976815
## [2,] 0.2388549 0.0000000 2.9298206 2.7737810 2.4562436
## [3,] 2.9895895 2.9298206 0.0000000 0.8050362 0.3857646
## [4,] 2.9680409 2.7737810 0.8050362 0.0000000 0.3564512
## [5,] 2.5976815 2.4562436 0.3857646 0.3564512 0.0000000
perb(Y[, 1:4])@matrix
## Calculating rho from "count matrix".
##            [,1]       [,2]       [,3]       [,4]
## [1,]  1.0000000  0.8479235 -0.8571942 -0.9020354
## [2,]  0.8479235  1.0000000 -0.9113638 -0.8627917
## [3,] -0.8571942 -0.9113638  1.0000000  0.6928331
## [4,] -0.9020354 -0.8627917  0.6928331  1.0000000
perb(X)@matrix
## Calculating rho from "count matrix".
##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,]  1.0000000  0.8876058 -0.8072883 -0.8462492 -0.8459643
## [2,]  0.8876058  1.0000000 -0.8526537 -0.8011329 -0.8035079
## [3,] -0.8072883 -0.8526537  1.0000000  0.5826229  0.7622388
## [4,] -0.8462492 -0.8011329  0.5826229  1.0000000  0.7865827
## [5,] -0.8459643 -0.8035079  0.7622388  0.7865827  1.0000000

Still, in comparing the dependence between “c” and “d” as calculated by \(cov(Y)\) with that of \(\rho(Y)\), it appears that proportionality analysis does offer at least partial protection against spurious results.

cor(Y)
##           a         b         c         d         e
## a 1.0000000 0.9918545 0.8606885 0.8700002 0.8630598
## b 0.9918545 1.0000000 0.8553602 0.8677473 0.8622694
## c 0.8606885 0.8553602 1.0000000 0.9857120 0.9923988
## d 0.8700002 0.8677473 0.9857120 1.0000000 0.9909547
## e 0.8630598 0.8622694 0.9923988 0.9909547 1.0000000
perb(Y)@matrix
## Calculating rho from "count matrix".
##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,]  1.0000000  0.8876058 -0.8072883 -0.8462492 -0.8459643
## [2,]  0.8876058  1.0000000 -0.8526537 -0.8011329 -0.8035079
## [3,] -0.8072883 -0.8526537  1.0000000  0.5826229  0.7622388
## [4,] -0.8462492 -0.8011329  0.5826229  1.0000000  0.7865827
## [5,] -0.8459643 -0.8035079  0.7622388  0.7865827  1.0000000

Finally, the reader should note that in this contrived example, \(\phi(X) = \phi(Y)\) and \(\rho(X) = \rho(Y)\), but only because the sum of the feature parts in the relative dataset can explain the whole of absolute dataset. In other words, this comes from the fact that in crafting the relative dataset, we used information spanning the entire absolute dataset (i.e., rowSums). This is usually not the case when studying biological count data and alone does not imply sub-compositional coherence.

An in-depth look at alr

Unlike the centered log-ratio (clr) which adjusts each subject vector by the geometric mean of that vector, the additive log-ratio (alr) adjusts each subject vector by the value of one its own components, chosen as a reference. If we select as a reference some feature \(D\) with an a priori known fixed absolute count across all subjects, we can effectively “back-calculate” absolute data from relative data. When initially crafting the data \(X\), we included “e” as this fixed value.

The following figures compare pairwise scatterplots for alr-transformed relative count data (i.e., \(alr(Y)\) with “e” as the reference) and the corresponding absolute count data. We see here how alr-transformation eliminates the spurious correlation between “c” and “d”.

pairs(propr:::proprALR(Y, ivar = 5))
pairs(X[, 1:4])

Again, this gets reflected in the results of perb when we select “e” as the reference.

perb(Y, ivar = 5)@matrix
## Calculating rho from "count matrix".
##             [,1]        [,2]        [,3]        [,4] [,5]
## [1,]  1.00000000  0.95544861 -0.04896295 -0.05464219    0
## [2,]  0.95544861  1.00000000 -0.09299877 -0.04720992    0
## [3,] -0.04896295 -0.09299877  1.00000000 -0.12304138    0
## [4,] -0.05464219 -0.04720992 -0.12304138  1.00000000    0
## [5,]  0.00000000  0.00000000  0.00000000  0.00000000    1

Now, let us assume these same data, \(X\), actually measure relative counts. In other words, \(X\) is already relative and we do not know the real quantities which correspond to \(X\) absolutely. Well, if we knew that “a” represented a known fixed quantity, we could use alr-transformation again to “back-calculate” the absolute abundances. In this case, we will see that “c”, “d”, and “e” actually do have proportional expression under these conditions. Although the measured quantity of “c”, “d”, and “e” do not change considerably across subjects, the measured quantity of the known fixed feature does change. As such. whenever “a” increases while “c”, “d”, and “e” remains the same, the latter three features have actually decreased. Since they all decreased together, they act as a highly proportional module.

pairs(propr:::proprALR(X, ivar = 1))

Again, this gets reflected in the results of perb when we select “a” as the reference.

perb(X, ivar = 1)@matrix
## Calculating rho from "count matrix".
##      [,1]        [,2]        [,3]       [,4]       [,5]
## [1,]    1  0.00000000  0.00000000 0.00000000 0.00000000
## [2,]    0  1.00000000 -0.02107964 0.02680645 0.02569491
## [3,]    0 -0.02107964  1.00000000 0.91160199 0.95483279
## [4,]    0  0.02680645  0.91160199 1.00000000 0.96108648
## [5,]    0  0.02569491  0.95483279 0.96108648 1.00000000

We can visualize this module using the bundled visualization method dendrogram.

dendrogram(perb(X, ivar = 1))
## Calculating rho from "count matrix".
## Alert: Generating plot using all feature pairs.

## 
## Call:
## fastcluster::hclust(d = dist)
## 
## Cluster method   : complete 
## Number of objects: 5

Resuming our initial claim that the matrix \(X\) contains absolute count data while the matrix \(Y\) contains relative count data, we can show that alr-transformation not only corrects for spurious proportionality, but it also serves as a sub-compositionally coherent metric of dependence. However, unlike the aforementioned vlr, \(\rho\) has a meaningful scale. In the example below, we calculate \(\rho\) using the alr-transformation about the reference “e” for four compositions of the relative count matrix, \(Y\), as well as for the absolute count matrix, \(X\). We see here that, unlike clr-transformed proportionality metrics, the alr-transformed metric \(\rho\) yields identical results regardless of the nature of the data explored. Of course, this assumes that one knows the identity of a feature fixed across all subjects. Still, at this point, one might also consider “back-calculating” the absolute abundances and measuring dependence through more conventional means.

perb(Y[, 2:5], ivar = 4)@matrix
## Calculating rho from "count matrix".
##             [,1]        [,2]        [,3] [,4]
## [1,]  1.00000000 -0.09299877 -0.04720992    0
## [2,] -0.09299877  1.00000000 -0.12304138    0
## [3,] -0.04720992 -0.12304138  1.00000000    0
## [4,]  0.00000000  0.00000000  0.00000000    1
perb(X, ivar = 5)@matrix
## Calculating rho from "count matrix".
##             [,1]        [,2]        [,3]        [,4] [,5]
## [1,]  1.00000000  0.95544861 -0.04896295 -0.05464219    0
## [2,]  0.95544861  1.00000000 -0.09299877 -0.04720992    0
## [3,] -0.04896295 -0.09299877  1.00000000 -0.12304138    0
## [4,] -0.05464219 -0.04720992 -0.12304138  1.00000000    0
## [5,]  0.00000000  0.00000000  0.00000000  0.00000000    1

Limitations

Although we developed this package with biological count data in mind, many of the ostensibly compositional biological datasets do not behave in a truly compositional manner. For example, in the setting of gene expression data, measuring the expression of “Gene A” as 1 in one subject and the expression of “Gene B” as 2 in another subject (i.e., the feature vector \([1, 2]\)), does not carry the same information as measuring the expression of “Gene A” as 1000 in one subject and the expression of “Gene B” as 2000 in another subject (i.e., the feature vector \([1000, 2000]\)). As such, these data do not strictly meet the criteria for compositional data. Unfortunately, we do not yet have a model to adequately address this drawback. Therefore, we advise the investigator to proceed with caution when working with such “count compositional” data.

References

  1. Erb, Ionas, and Cedric Notredame. “How Should We Measure Proportionality on Relative Gene Expression Data?” Theory in Biosciences = Theorie in Den Biowissenschaften 135, no. 1–2 (June 2016): 21–36. .

  2. Lovell, David, Vera Pawlowsky-Glahn, Juan José Egozcue, Samuel Marguerat, and Jürg Bähler. “Proportionality: A Valid Alternative to Correlation for Relative Data.” PLoS Computational Biology 11, no. 3 (March 2015): e1004075. .