| Title: | Bayesian Hierarchical Modeling for Label-Free Proteomics |
| Version: | 0.0.3 |
| Description: | Statistical decision in proteomics data using a hierarchical Bayesian model. There are two regression models for describing the mean-variance trend, a gamma regression or a latent gamma mixture regression. The regression model is then used as an Empirical Bayes estimator for the prior on the variance in a peptide. Further, it assumes that each measurement has an uncertainty (increased variance) associated with it that is also inferred. Finally, it tries to estimate the posterior distribution (by Hamiltonian Monte Carlo) for the differences in means for each peptide in the data. Once the posterior is inferred, it integrates the tails to estimate the probability of error from which a statistical decision can be made. See Berg and Popescu for details (<doi:10.1101/2023.05.11.540411>). |
| License: | MIT + file LICENSE |
| URL: | https://github.com/PhilipBerg/baldur |
| Encoding: | UTF-8 |
| Language: | en-US |
| RoxygenNote: | 7.2.3 |
| Biarch: | true |
| Depends: | R (≥ 4.2.0) |
| Imports: | dplyr (≥ 1.0.9), magrittr (≥ 2.0.3), methods, purrr (≥ 0.3.4), Rcpp (≥ 0.12.0), RcppParallel (≥ 5.0.1), rstan (≥ 2.26.0), rstantools (≥ 2.2.0), stats, stringr (≥ 1.0.4), tidyr (≥ 1.2.0), rlang (≥ 1.0.2), Rdpack (≥ 2.4), multidplyr (≥ 0.1.1), ggplot2 (≥ 3.3.6), tibble (≥ 3.1.7), viridisLite (≥ 0.4.1), lifecycle |
| LinkingTo: | BH (≥ 1.66.0), Rcpp (≥ 0.12.0), RcppEigen (≥ 0.3.3.3.0), RcppParallel (≥ 5.0.1), rstan (≥ 2.26.0), StanHeaders (≥ 2.26.0) |
| SystemRequirements: | GNU make |
| Suggests: | knitr, rmarkdown |
| VignetteBuilder: | knitr |
| LazyData: | true |
| RdMacros: | Rdpack |
| BugReports: | https://github.com/PhilipBerg/baldur/issues |
| NeedsCompilation: | yes |
| Packaged: | 2023-09-12 15:21:02 UTC; Pberg |
| Author: | Philip Berg |
| Maintainer: | Philip Berg <pb1015@msstate.edu> |
| Repository: | CRAN |
| Date/Publication: | 2023-09-18 08:50:05 UTC |
The 'baldur' package.
Description
baldur is a Bayesian hierarchical model for statistical decision
in proteomics data. It models the mean-variance trend with the option of
two different regression models, a gamma regression or a latent gamma
mixture regression. It then the regression model as en Empirical Bayes
estimator for the prior on the variance. Further, it assumes that
each measurement has an uncertainty (increased variance) associated with it
that it also infers. Finally, it tries to estimate the posterior
distribution (by Hamiltonian Monte Carlo) for the differences in means for
each peptide in the data. Once the posterior is inferred, it integrates the
tails to estimate the probability of error from which a statistical
decision can be made.
References
Berg and Popescu (2023). Baldur: Bayesian Hierarchical Modeling For Label-Free Proteomics Exploiting Gamma Dependent Mean-Variance Trends. Under review. Stan Development Team (2022). RStan: the R interface to Stan. R package version 2.21.5. https://mc-stan.org
Pipe operator
Description
See magrittr::%>% for details.
Usage
lhs %>% rhs
Arguments
lhs |
A value or the magrittr placeholder. |
rhs |
A function call using the magrittr semantics. |
Value
The result of calling rhs(lhs).
Calculate the Mean-Variance trend
Description
Calculates the mean and standard deviation of each row (peptide) and adds them as new columns. Assumes that the condition names are the names in the design matrix.
Usage
calculate_mean_sd_trends(data, design_matrix)
Arguments
data |
A |
design_matrix |
A design matrix for the data (see example). |
Value
A tibble or data.frame with the mean and sd vectors
Examples
# Setup model matrix
design <- model.matrix(~ 0 + factor(rep(1:2, each = 3)))
colnames(design) <- paste0("ng", c(50, 100))
yeast %>%
# Normalize data
psrn("identifier") %>%
# Get mean-variance trends
calculate_mean_sd_trends(design)
Baldur's empirical Bayes Prior For The Mean In Conditions
Description
Here we assume that the sample mean of each condition is an estimator for the center of the mean prior. In addition, it assumes that the confidence in the prior is proportional to the variance of the peptide.
\boldsymbol{\mu}_0\sim\mathcal{N}(\boldsymbol{\bar{y}},\sigma\boldsymbol{n}_R)
\boldsymbol{n}_R=[\frac{1}{\sqrt{n_1}},\frac{1}{\sqrt{n_2}},\dots,\frac{1}{\sqrt{n_C}}]
Value
A stanmodel that can be used in infer_data_and_decision_model.
Code
The Stan code for this model is given by:
empirical_bayes
S4 class stanmodel 'empirical_bayes' coded as follows:
data {
int<lower=0> N; // number of data items
int<lower=0> K; // number of conditions
int C; // number of comparisons to perform
matrix[N, K] x; // design matrix
vector[N] y; // data
matrix[K, C] c; // contrast matrix
real alpha; // alpha prior for gamma
real beta; // beta prior for gamma
vector[N] u; // uncertainty
vector[K] mu_not; // prior mu
}
transformed data{
vector[K] n_k; // per condition reciprocal measurements
row_vector[C] n_c; // per comparison measurements
matrix[K, C] abs_c; // abs of C for n_c calculation
for (i in 1:K) {
for (j in 1:C) {
abs_c[i, j] = abs(c[i, j]);
}
}
for (i in 1:K) {
n_k[i] = 1/sum(x[,i]);
}
n_c = n_k' * abs_c;
n_c = sqrt(n_c);
n_k = sqrt(2*n_k);
}
parameters {
vector[K] mu; // mean vector
real<lower=0> sigma; // error scale
array[C] real y_diff; // difference in coefficients
vector[K] eta; // Error in mean
vector[K] prior_mu_not; // Estimation error
}
transformed parameters{
row_vector[C] mu_diff = mu' * c; // differences in means
vector[K] sigma_mu_not = sigma * n_k; // variance of ybars
vector[C] sigma_lfc = sigma * n_c'; // variance of y_diff
}
model {
sigma ~ gamma(alpha, beta); // variance
eta ~ normal(0, 1); // NCP auxilary variable
prior_mu_not ~ normal(mu_not, sigma_mu_not); // Prior mean
mu ~ normal(prior_mu_not + sigma * eta, sigma); // mean
y ~ normal(x * mu, sigma * u); // data model
y_diff ~ normal(mu_diff, sigma_lfc); // difference statistic
}
Estimate Gamma hyperparameters for sigma
Description
Estimates the hyperparameters for the Bayesian data and decision
model. estimate_gamma_hyperparameters is a wrapper that adds new columns
to the data (one for alpha and one for betas). Note that for lgmr
objects, the estimate_beta function assumes that the data is ordered as
when the model was fitted. If this is not the case, theta's will be
incorrectly matched with peptides—resulting in wrong estimates of beta
parameters. On the other hand, estimate_gamma_hyperparameters will
temporarily sort the data as when fitted and the sort it back as it
was input to the function.
Usage
estimate_gamma_hyperparameters(reg, data, ...)
## S3 method for class 'glm'
estimate_gamma_hyperparameters(reg, data, ...)
## S3 method for class 'lgmr'
estimate_gamma_hyperparameters(reg, data, id_col, ...)
estimate_beta(reg, mean, ...)
## S3 method for class 'glm'
estimate_beta(reg, mean, alpha, ...)
## S3 method for class 'lgmr'
estimate_beta(reg, mean, m, s, ...)
Arguments
reg |
A |
data |
A |
... |
Currently not in use |
id_col |
A character for the name of the column containing the name of the features in data (e.g., peptides, proteins, etc.) |
mean |
The mean value of the peptide |
alpha |
The alpha parameter of the peptide |
m |
The mean of the means |
s |
The sd of the means |
Value
estimate_gamma_hyperparameters returns a tibble or data.frame
with the alpha,beta hyperparameters estimates as new columns.
estimate_beta returns estimates of the beta parameter(s)
Examples
# Setup model matrix
design <- model.matrix(~ 0 + factor(rep(1:2, each = 3)))
colnames(design) <- paste0("ng", c(50, 100))
# Normalize data
yeast_norm <- yeast %>%
psrn("identifier") %>%
# Get mean-variance trends
calculate_mean_sd_trends(design)
# Fit gamma regression (could also have been a lgmr model)
gam_reg <- fit_gamma_regression(yeast_norm, sd ~ mean)
# Estimate priors
gam_reg %>%
estimate_gamma_hyperparameters(yeast_norm)
# Can also explicitly estimate the beta parameters
# Note this is order sensitive.
estimate_beta(gam_reg, yeast_norm$mean, 1/summary(gam_reg)$dispersion)
Estimate measurement uncertainty
Description
Estimates the measurement uncertainty for each data point using a Gamma regression. Calculated as the expected standard deviation for each measurement:
\text{E}[s_i|\omega,y_{ij}]=\exp({f(y_{ij},\omega)})
where \omega are the regression parameters and f is a function
describing the mean relationship between s_i and y_{ij}.
Usage
estimate_uncertainty(reg, data, id_col, design_matrix)
## S3 method for class 'glm'
estimate_uncertainty(reg, data, id_col, design_matrix)
## S3 method for class 'lgmr'
estimate_uncertainty(reg, data, id_col, design_matrix)
Arguments
reg |
A |
data |
A |
id_col |
A character for the name of the column containing the name of the features in data (e.g., peptides, proteins, etc.) |
design_matrix |
Cell mean design matrix for the data |
Value
A matrix with the uncertainty
Examples
# Setup model matrix
design <- model.matrix(~ 0 + factor(rep(1:2, each = 3)))
colnames(design) <- paste0("ng", c(50, 100))
yeast_norm <- yeast %>%
# Remove missing data
tidyr::drop_na() %>%
# Normalize data
psrn("identifier") %>%
# Add mean-variance trends
calculate_mean_sd_trends(design)
# Fit the gamma regression
gam <- fit_gamma_regression(yeast_norm, sd ~ mean)
# Estimate each data point's uncertainty
estimate_uncertainty(gam, yeast_norm, 'identifier', design)
Function for Fitting the Mean-Variance Gamma Regression Models
Description
fit_gamma_regression returns a glm object containing the
gamma regression for the mean-variance trend.
Usage
fit_gamma_regression(data, formula = sd ~ mean, ...)
Arguments
data |
a |
formula |
a formula describing the model |
... |
only for compatibility with other functions |
Value
fit_gamma_regression returns a glm object
Examples
# Generate a design matrix for the data
design <- model.matrix(~ 0 + factor(rep(1:2, each = 3)))
# Set correct colnames, this is important for calculate_mean_sd_trends
colnames(design) <- paste0("ng", c(50, 100))
# Normalize and log-transform the data
yeast_norm <- psrn(yeast, "identifier") %>%
# Add row means and variances
calculate_mean_sd_trends(design)
# Fit gamma regression model for the mean-variance trends
gamma_model <- fit_gamma_regression(yeast_norm, sd ~ mean)
Fit Latent Gamma Mixture Regression
Description
See lgmr_model for model details.
Usage
fit_lgmr(
data,
id_col,
model = lgmr_model,
iter = 6000,
warmup = 1500,
chains = 5,
cores = 1,
return_stanfit = FALSE,
simplify = FALSE,
...
)
## S3 method for class 'lgmr'
print(
x,
simplify = x$simplify,
pars = c("auxiliary", "coefficients"),
digits = 3,
...
)
## S3 method for class 'lgmr'
coef(object, simplify = FALSE, pars = c("coefficients", "auxiliary"), ...)
Arguments
data |
A |
id_col |
A character for the name of the column containing the name of the features in data (e.g., peptides, proteins, etc.). Has to be a unique identifier for each feature. |
model |
Defaults to lgmr_model (see it for details on the model), can
also be an user supplied |
iter |
Total number of samples to draw |
warmup |
Number of warm-up samples to draw |
chains |
Number of chains to run |
cores |
Number of cores to use per chain |
return_stanfit |
Should the |
simplify |
Should only the mean estimates of the posterior be returned? |
... |
Additional arguments to |
x, object |
An |
pars |
If you want to print/extract the regression coefficients, theta, auxiliary (alpha and NRMSE), or all |
digits |
Number of digits to print |
Value
A fitted lgmr model.
Examples
# Define design matrix
design <- model.matrix(~ 0 + factor(rep(1:2, each = 3)))
colnames(design) <- paste0("ng", c(50, 100))
# Normalize data, calculate M-V trend, and fit LGMR model
yeast_lgmr <- yeast %>%
# Remove missing values
tidyr::drop_na() %>%
# Normalize
psrn("identifier") %>%
# Add the mean-variance trends
calculate_mean_sd_trends(design) %>%
# Fit the model
fit_lgmr("identifier")
# Print everything except thetas
print(yeast_lgmr, pars = c("coefficients", "auxiliary"))
# Extract the mean of the model parameters posterior
yeast_lgmr_pars <- coef(yeast_lgmr, pars = 'all', simplify = TRUE)
Sample the Posterior of the data and decision model and generate point estimates
Description
Function to sample the posterior of the Bayesian data and
decision model. It first produces the needed inputs for Stan's sampling()
for each peptide (or protein, PTM, etc.). It then runs the sampling for the
data and decision model. From the posterior, it then collects estimates and
sampling statistics from the posterior of data model and integrates the
decision distribution D. It then returns a tibble() with all the
information for each peptide's posterior (see Value below). There are
major time gains to be made by running this procedure in parallel.
infer_data_and_decision_model has an efficient wrapper around
multidplyr. This will let you to evenly distribute all peptides evenly to
each worker. E.g., two workers will each run half of the peptides in
parallel.
Usage
infer_data_and_decision_model(
data,
id_col,
design_matrix,
contrast_matrix,
uncertainty_matrix,
stan_model = empirical_bayes,
clusters = 1,
h_not = 0,
...
)
Arguments
data |
A |
id_col |
A character for the name of the column containing the name of the features in data (e.g., peptides, proteins, etc.) |
design_matrix |
A design matrix for the data. For the empirical_bayes prior only mean models are allowed (see example). For the weakly_informative prior more complicated design can be used. |
contrast_matrix |
A contrast matrix of the decisions to test. Columns
should sum to |
uncertainty_matrix |
A matrix of observation specific uncertainties |
stan_model |
Which Bayesian model to use. Defaults to empirical_bayes but also allows weakly_informative, or an user supplied function see []. |
clusters |
The number of parallel threads/workers to run on. |
h_not |
The value of the null hypothesis for the difference in means |
... |
Additional arguments passed to
|
Details
The data model of Baldur assumes that the observations of a peptide,
\boldsymbol{Y}, is a normally distributed with a standard deviation,
\sigma, common to all measurements. In addition, it assumes that each
measurement has a unique uncertainty u. It then models all
measurements in the same condition with a common mean \mu. It then
assumes that the common variation of the peptide is caused by the variation
in the \mu As such, it models \mu with the common variance
\sigma and a non-centered parametrization for condition level
effects.
\boldsymbol{Y}\sim\mathcal{N}(\boldsymbol{X}\boldsymbol{\mu},\sigma\boldsymbol{u})\quad
\boldsymbol{\mu}\sim\mathcal{N}(\mu_0+\boldsymbol{\eta}\sigma,\sigma)
It then assumes \sigma to be gamma distributed with hyperparameters
infered from either a gamma regression fit_gamma_regression or a latent
gamma mixture regression fit_lgmr.
\sigma\sim\Gamma(\alpha,\beta)
For details on the two priors for \mu_0 see empirical_bayes or
weakly_informative. Baldur then builds a posterior distribution of the
difference(s) in means for contrasts of interest. In addition, Baldur
increases the precision of the decision as the number of measurements
increase. This is done by weighting the sample size with the contrast
matrix. To this end, Baldur limits the possible contrast of interest such
that each column must sum to zero, and the absolute value of each column
must sum to two. That is, only mean comparisons are allowed.
\boldsymbol{D}\sim\mathcal{N}(\boldsymbol{\mu}^\text{T}\boldsymbol{K},\sigma\boldsymbol{\xi}),\quad \xi_{i}=\sqrt{\sum_{c=1}^{C}|k_{cm}|n_c^{-1}}
where \boldsymbol{K} is a contrast matrix of interest and
k_{cm} is the contrast of the c:th condition in the m:th contrast of
interest, and n_c is the number of measurements in the c:th
condition. Baldur then integrates the tails of \boldsymbol{D} to
determine the probability of error.
P(\text{\textbf{error}})=2\Phi(-\left|\boldsymbol{\mu}_{\boldsymbol{D}}-H_0\right|\odot\boldsymbol{\tau}_{\boldsymbol{D}})
where H_0 is the null hypothesis for the difference in means,
\Phi is the CDF of the standard normal,
\boldsymbol{\mu}_{\boldsymbol{D}} is the posterior mean of
\boldsymbol{D}, \boldsymbol{\tau}_{\boldsymbol{D}} is the
posterior precision of \boldsymbol{D}, and \odot is the
Hadamard product.
Value
A tibble() or data.frame() annotated with statistics of the
posterior and p(error). err is the probability of error, i.e., the two
tail-density supporting the null-hypothesis, lfc is the estimated
\log_2-fold change, sigma is the common variance, and lp is the
log-posterior. Columns without suffix shows the mean estimate from the
posterior, while the suffixes _025, _50, and _975, are the 2.5, 50.0,
and 97.5, percentiles, respectively. The suffixes _eff and _rhat are
the diagnostic variables returned by Stan. In general, a larger _eff
indicates effective sample size and _rhat indicates the mixing within
chains and between the chains and should be smaller than 1.05 (please see
the Stan manual for more details). In addition it returns a column
warnings with potential warnings given by the sampler.
Examples
# (Please see the vignette for a tutorial)
# Setup model matrix
design <- model.matrix(~ 0 + factor(rep(1:2, each = 3)))
colnames(design) <- paste0("ng", c(50, 100))
yeast_norm <- yeast %>%
# Remove missing data
tidyr::drop_na() %>%
# Normalize data
psrn('identifier') %>%
# Add mean-variance trends
calculate_mean_sd_trends(design)
# Fit the gamma regression
gam <- fit_gamma_regression(yeast_norm, sd ~ mean)
# Estimate each data point's uncertainty
unc <- estimate_uncertainty(gam, yeast_norm, 'identifier', design)
yeast_norm <- gam %>%
# Add hyper-priors for sigma
estimate_gamma_hyperparameters(yeast_norm)
# Setup contrast matrix
contrast <- matrix(c(-1, 1), 2)
yeast_norm %>%
head() %>% # Just running a few for the example
infer_data_and_decision_model(
'identifier',
design,
contrast,
unc,
clusters = 1 # I highly recommend increasing the number of parallel workers/clusters
# this will greatly reduce the speed of running baldur
)
Latent Gamma Regression Model
Description
Baldur has the option to use the Latent Gamma Regression Model
(LGMR). This model attempts to estimate local Mean-Variance (M-V) trend for
each peptide (or protein, PTM, etc.). To this end is describes the M-V
trend as a regression with a common link function and a latent one. While
there may not be a latent trend in real data, it allows for a mathematical
description that can estimate the local trends of each peptide. The model
describes the standard deviation (S) as a gamma random variable with mean
dependency described by the sample mean (\bar{y}):
\boldsymbol{S}\sim\Gamma(\alpha,\beta),\quad\boldsymbol{\beta}=\frac{\alpha}{\boldsymbol{\mu}}
\boldsymbol{\mu}=\exp(\gamma_0-\gamma_{\bar{y}}\,f(\boldsymbol{\bar{y}}))
+ \kappa
\exp(\boldsymbol{\theta}(\gamma_{0L}-\gamma_{\bar{y}L}\,f(\boldsymbol{\bar{y}})),\quad
f(\boldsymbol{x})=\frac{\boldsymbol{x}-\mu_{\bar{y}}}{\sigma_{\bar{y}}}
Here, \exp(\gamma_0-\gamma_{\bar{y}}f(\boldsymbol{\bar{y}})) is the common trend of the
regression. Then, the mixing variable, \boldsymbol{\theta}, describes
the proportion of the mixture that each peptide has, and \kappa is
just some small constant such that when \theta is zero the latent
term is small.
Details
Next we will describe the hierarchical prior setup for the
regression variables.
For \alpha baldur uses a half-Cauchy as a weakly informative prior:
\alpha\sim\text{Half-Cauchy}(25)
For the latent contribution to the i:th peptide, \theta_i, has an uniform distribution:
\theta_i\sim\mathcal{U}(0, 1)
Further, it can be seen that Baldur assumes
that S always decreases (on average) with the sample mean. To this end, both
slopes (\gamma_{\bar{y}},\gamma_{\bar{y}L}) are defined on the real
positive line. Hence, we used Half-Normal (HN)
distributions to describe the slope parameters:
\gamma_{\bar{y}}\sim
HN(1)
\gamma_{\bar{y}L}\sim HN(1)
For the intercepts, we assume a
standard normal prior for the common intercept. Then, we use a skew-normal to
model the latent intercept. The reason for this is two-fold, one,
\kappa will decrease the value of the latent term and, two, to push the
latent trend upwards. The latter reason is so that the latent intercept is
larger than the common and so that the priors prioritize a shift in intercept
over a increase in slope.
For the intercepts, Baldur uses a standard normal prior for the common intercept.
\gamma_0\sim\mathcal{N}(0,1)
While for the latent trend, it uses a skew-normal (SN) to push up the second
trend and to counteract the shrinkage of \kappa.
\gamma_{0L}\sim\mathcal{SN}(2,15,35)
Value
A stanmodel that can be used in fit_lgmr.
Code
The Stan code for this model is given by:
lgmr_model
S4 class stanmodel 'lgmr_model' coded as follows:
functions {
// mu
vector reg_function(
vector x,
vector theta,
real I,
real I_L,
real S,
real S_L,
int N
) {
vector[N] exp_beta = .001*exp(theta .* (I_L - S_L*x));
exp_beta += exp(I - S*x);
return exp_beta;
}
}
data {
int<lower=0> N; // Number of observations
vector<lower=0>[N] y; // sd
vector[N] x; // mean
}
transformed data {
real v_y = variance(y); // for NRMSE
vector[N] x_star = (x - mean(x))/sd(x); // f(y_bar)
vector[3] lb = [0, 0, negative_infinity()]'; // Boundaries normal coefs.
}
parameters {
real<lower = 0> alpha; // Shape
vector<lower = lb>[3] eta; // For S, S_L, I
vector<lower = 0, upper = 1>[N] theta; // Mixture parameter
real I_L; // Intercept Latent
}
transformed parameters {
real<lower = 0> S = eta[1]; // Slope common
real<lower = 0> S_L = eta[2]; // Slope Latent
real I = eta[3]; // Intercep common
}
model {
//Priors
alpha ~ cauchy(0, 25);
eta ~ std_normal();
I_L ~ skew_normal(2, 15, 35);
theta ~ beta(1, 1);
{
vector[N] exp_beta = reg_function(x_star, theta, I, I_L, S, S_L, N);
// Likelihood
y ~ gamma(alpha, alpha ./ exp_beta);
}
}
generated quantities {
// NRMSE calculations
real nrmse;
{
vector[N] se = reg_function(x_star, theta, I, I_L, S, S_L, N);
se -= y;
se = square(se);
nrmse = mean(se) / v_y;
}
nrmse = sqrt(nrmse);
}
Visualization of LGMR models
Description
Options to plot the LGMR model.
plot_lgmr_regression will plot the data
colored by the amount of latent trend they have as well as the two extreme
regression cases when \theta is zero or one. plot_regression_field
will plot the local regression trend for each data point as a vector field and
color the vector based on the derivative at the mean of the peptide.
Usage
plot_lgmr_regression(model)
plot_regression_field(model, n = 10, rng = 10)
Arguments
model |
An LGMR model object |
n |
Number of interpolation points for each peptides regression |
rng |
The proportional range of each peptides regression. E.g., a value of 10 will make each line span 1 % of the x-axis. |
Value
A ggplot object
Examples
#' # Define design matrix
design <- model.matrix(~ 0 + factor(rep(1:2, each = 3)))
colnames(design) <- paste0("ng", c(50, 100))
# Normalize data, calculate M-V trend, and fit LGMR model
yeast_lgmr <- yeast %>%
# Remove missing values
tidyr::drop_na() %>%
# Normalize
psrn("identifier") %>%
# Add the mean-variance trends
calculate_mean_sd_trends(design) %>%
# Fit the model
fit_lgmr("identifier")
# Print everything except thetas
print(yeast_lgmr, pars = c("coefficients", "auxiliary"))
# Extract the mean of the model parameters posterior
yeast_lgmr_pars <- coef(yeast_lgmr, pars = 'all', simplify = TRUE)
plot_lgmr_regression(yeast_lgmr)
plot_regression_field(yeast_lgmr)
Function for plotting the gamma regression for the mean-variance trend
Description
Generates a scatter plot with the gamma regressions of the mean-variance trends without partitioning
Usage
plot_gamma(data)
Arguments
data |
The data to use for producing the plots. |
Value
a plot with the estimated mean-variance trend
Examples
# Produce a design matrix
design <- model.matrix(~ 0 + factor(rep(1:2, each = 3)))
colnames(design) <- paste0("ng", c(50, 100))
# Normalize and log transform the data
yeast_norm <- psrn(yeast, "identifier")
# Generate the plot
yeast_norm %>%
calculate_mean_sd_trends(design) %>%
plot_gamma()
Function for plotting the mean-variance gamma regressions
Description
Generates a scatter plot with the gamma regressions of the mean-variance
trend. Can either be ran directly with plot_gamma_regression and inputing
a design matrix, or with plot_gamma if the M-V trend has been added to the
data with calculate_mean_sd_trends().
Usage
plot_gamma_regression(data, design)
Arguments
data |
The data to use for producing the plots. |
design |
A design matrix as produced by
|
Value
a plot with the mean-variance trend before partitioning on the left side, and the right side after.
Examples
# Produce a design matrix
design <- model.matrix(~ 0 + factor(rep(1:2, each = 3)))
colnames(design) <- paste0("ng", c(50, 100))
# Normalize and log transform the data
yeast %>%
# Remove missing data
# Note that this could be replaced with imputation
tidyr::drop_na() %>%
# Normalize
psrn("identifier") %>%
plot_gamma_regression(design)
Plot the trend between the log fold-change and sigma, coloring significant hits
Description
plot_sa returns a ggplot with a graphical representation between the log
fold-change and sigma.
Usage
plot_sa(results, alpha = 0.05, lfc = NULL)
Arguments
results |
Output generated by
|
alpha |
Significance cut-off; used to draw a line indicating where significance starts |
lfc |
LFC cut-off; used to draw lines for |
Value
plot_sa returns a ggplot object
Examples
# Setup model matrix
design <- model.matrix(~ 0 + factor(rep(1:2, each = 3)))
colnames(design) <- paste0("ng", c(50, 100))
yeast_norm <- yeast %>%
# Remove missing data
tidyr::drop_na() %>%
# Normalize data
psrn('identifier') %>%
# Add mean-variance trends
calculate_mean_sd_trends(design)
# Fit the gamma regression
gam <- fit_gamma_regression(yeast_norm, sd ~ mean)
# Estimate each data point's uncertainty
unc <- estimate_uncertainty(gam, yeast_norm, "identifier", design)
yeast_norm <- gam %>%
# Add hyper-priors for sigma
estimate_gamma_hyperparameters(yeast_norm)
# Setup contrast matrix
contrast <- matrix(c(-1, 1), 2)
results <- yeast_norm %>%
head() %>% # Just run a few for the example
infer_data_and_decision_model(
'identifier',
design,
contrast,
unc,
clusters = 1 # I highly recommend increasing the number of parallel workers/clusters
# this will greatly reduce the speed of running baldur
)
# Plot with alpha = 0.05
plot_sa(results, alpha = 0.05)
# Plot with alpha = 0.01 and show LFC = 1
plot_sa(results, alpha = 0.01, 1)
Plot the -log10(err) against the log fold-change
Description
plot_volcano returns a ggplot with a graphical representation between the
-log10(err) and LFC.
Usage
plot_volcano(results, alpha = 0.05, lfc = NULL)
Arguments
results |
Output generated by
|
alpha |
Significance cut-off; used to draw a line indicating where significance starts |
lfc |
LFC cut-off; used to draw lines for |
Value
plot_volcano returns a ggplot object
Examples
# Setup model matrix
design <- model.matrix(~ 0 + factor(rep(1:2, each = 3)))
colnames(design) <- paste0("ng", c(50, 100))
yeast_norm <- yeast %>%
# Remove missing data
tidyr::drop_na() %>%
# Normalize data
psrn('identifier') %>%
# Add mean-variance trends
calculate_mean_sd_trends(design)
# Fit the gamma regression
gam <- fit_gamma_regression(yeast_norm, sd ~ mean)
# Estimate each data point's uncertainty
unc <- estimate_uncertainty(gam, yeast_norm, 'identifier', design)
yeast_norm <- gam %>%
# Add hyper-priors for sigma
estimate_gamma_hyperparameters(yeast_norm)
# Setup contrast matrix
contrast <- matrix(c(-1, 1), 2)
results <- yeast_norm %>%
head() %>% # Just run a few for the example
infer_data_and_decision_model(
'identifier',
design,
contrast,
unc,
clusters = 1 # I highly recommend increasing the number of parallel workers/clusters
# this will greatly reduce the speed of running baldur
)
# Plot indicating where alpha = 0.05
plot_volcano(results, alpha = 0.05)
# Plot indicating where alpha = 0.01 and where LFC = 1
plot_volcano(results, alpha = 0.01, 1)
Normalize data to a pseudo-reference
Description
This function generates a pseudo-reference by taking the geometric mean of each peptide across all samples. Each peptide in each sample is then divided by the pseudo-reference. Then, the median ratio of all ratios is used as an estimate to use for normalizing differences in loading concentration. All features in each sample is then divided by their corresponding estimate. All estimates are based on features without missing values. For details see Anders and Huber (2010).
Usage
psrn(data, id_col, log = TRUE, load_info = FALSE, target = NULL)
Arguments
data |
data.frame |
id_col |
a character for the name of the column containing the name of the features in data (e.g., peptides, proteins, etc.) |
log |
boolean variable indicating if the data should be log transformed after normalization |
load_info |
logical; should the load information be output? |
target |
target columns to normalize, supports
|
Value
data frame with normalized values if load_info=FALSE, if it is TRUE
then it returns a list with two tibbles. One tibble containing the
normalized data and one containing the loading info as well as the
estimated normalization factors.
Source
https://www.nature.com/articles/npre.2010.4282.1
References
Simon Anders, Wolfgang Huber (2010). “Differential expression analysis for sequence count data.” Nature Precedings, 1–1.
Examples
yeast_psrn <- psrn(yeast, "identifier")
yeast_psrn_with_load <- psrn(yeast, "identifier", load_info = TRUE)
yeast_ng50_only <- psrn(yeast, "identifier", target = matches('ng50'))
Spiked-in data set of peptides
Description
A dataset containing quantification of peptides using Progenesis. True
positives peptides spiked-in from the Universal Proteomics Standard Set 1
(UPS1) at three different concentrations and true negatives from
Chlamydomonas reinhardtii with the same concentration in all samples.
You can find true positives with stringr::str_detect(ups$identifier, 'UPS'). For details see Berg et al. (2019) and
if you use this dataset please cite the same paper.
Usage
ups
Format
A data frame with 10599 rows and 13 variables:
- identifier
id column for features, true positives contains UPS and true negatives contains Cre
- fmol25_1,fmol25_2,fmol25_3,fmol25_4
Technical replicates with true positives spiked-in from 25 fmol UPS1 peptides
- fmol50_1,fmol50_2,fmol50_3,fmol50_4
Technical replicate with true positives spiked-in from 50 fmol UPS1 peptides
- fmol100_1,fmol100_2,fmol100_3,fmol100_4
Technical replicate with true positives spiked-in from 100 fmol UPS1 peptides
Source
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2619-6
References
Philip Berg, Evan W McConnell, Leslie M Hicks, Sorina C Popescu, George V Popescu (2019). “Evaluation of linear models and missing value imputation for the analysis of peptide-centric proteomics.” BMC bioinformatics, 20(2), 7–16.
Baldur's weakly informative prior for the mean in conditions
Description
Here we will model the mean of the prior with a weakly
informative (WI) prior. We will assume that, in essence, nothing is know
about the mean. As such, for the WI prior, we use a normal prior on
\boldsymbol{\mu}_0 centered at zero and with a very large variance.
\boldsymbol{\mu}_0\sim\mathcal{N}(0,100)
Value
A stanmodel that can be used in infer_data_and_decision_model.
Code
The Stan code for this model is given by:
weakly_informative
S4 class stanmodel 'weakly_informative' coded as follows:
data {
int<lower=0> N; // number of data items
int<lower=0> K; // number of conditions
int C; // number of comparisons to perform
matrix[N, K] x; // design matrix
vector[N] y; // data
matrix[K, C] c; // contrast matrix
real alpha; // alpha prior for gamma
real beta; // beta prior for gamma
vector[N] u; // uncertainty
}
transformed data{
vector[K] n_k; // per condition measurements
row_vector[C] n_c; // per comparison measurements
matrix[K, C] abs_c; // abs of C for n_c calculation
for (i in 1:K) {
for (j in 1:C) {
abs_c[i, j] = abs(c[i, j]);
}
}
for (i in 1:K) {
n_k[i] = 1/sum(x[,i]);
}
n_c = n_k' * abs_c;
n_c = sqrt(n_c);
}
parameters {
vector[K] mu; // coefficients for predictors
real<lower=0> sigma; // error scale
array[C] real y_diff; // difference in coefficients
vector[K] eta; // Error in mean
vector[K] prior_mu_not; // Estimation error
}
transformed parameters{
row_vector[C] mu_diff = mu' * c; // differences in means
vector[C] sigma_lfc = sigma * n_c'; // variance of y_diff
}
model {
sigma ~ gamma(alpha, beta); // variance
eta ~ normal(0, 1); // NCP auxilary variable
prior_mu_not ~ normal(0, 10); // prior mean
mu ~ normal(prior_mu_not + sigma*eta, sigma); // mean
y ~ normal(x * mu, sigma*u); // data model
y_diff ~ normal(mu_diff, sigma_lfc); // difference statistic
}
Spiked-in data set of reversibly oxidized cysteines
Description
A dataset containing quantification of reversibly oxidized cysteines using
Progenesis. True positives cysteines spiked-in from yeast at two different
concentrations and true negatives from Chlamydomonas reinhardtii with
the same concentration in all samples. To identify true positives one can use
stringr::str_detect(yeast$identifier, 'YEAST'). For details see
Berg et al. (2019) and if you use this dataset
please cite the same paper.
Usage
yeast
Format
A data frame with 2235 rows and 7 variables:
- identifier
id column for features, true positives contains YEAST and true negatives contains Cre
- ng50_1,ng50_2,ng50_3
Biological replicates with true positives spiked-in from 50 ng yeast cells
- ng100_1,ng100_2,ng100_3
Biological replicates with true positives spiked-in from 100 ng yeast cells
Source
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2619-6
References
Philip Berg, Evan W McConnell, Leslie M Hicks, Sorina C Popescu, George V Popescu (2019). “Evaluation of linear models and missing value imputation for the analysis of peptide-centric proteomics.” BMC bioinformatics, 20(2), 7–16.