Title: | Individual Polygenic Risk Score Uncertainty Estimation |
Version: | 1.0.0 |
Description: | Provides tools for estimating uncertainty in individual polygenic risk scores (PRSs) using both sampling-based and analytical methods, as well as the Best Linear Unbiased Estimator (BLUE). These methods quantify variability in PRS estimates for both binary and quantitative traits. See Henderson (1975) <doi:10.2307/2529430> for more details. |
URL: | https://github.com/DoviniJ/iPRSue |
License: | GPL-3 |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Depends: | R (≥ 2.10) |
Imports: | data.table, bigstatsr, logistf, stats, utils |
NeedsCompilation: | no |
Packaged: | 2025-09-05 08:02:47 UTC; jaydy007 |
Author: | Dovini Jayasinghe [aut, cre, cph], Hong Lee [aut, cph] |
Maintainer: | Dovini Jayasinghe <dovini.jayasinghe@mymail.unisa.edu.au> |
Repository: | CRAN |
Date/Publication: | 2025-09-10 07:30:02 UTC |
BLUE_estimates_BT function
Description
Estimates individual-level polygenic risk scores (PRS) with uncertainty using a frequentist approach for binary traits. This implementation applies Firth's bias-reduced logistic regression on the discovery sample, computes the coefficient covariance matrix, and uses the delta method to derive PRS variance and confidence intervals.
Usage
BLUE_estimates_BT(
discovery_pheno,
discovery_geno_mat,
target_pheno,
target_geno_mat,
significance_level = 0.05,
max_iterations = 100
)
Arguments
discovery_pheno |
Character. Path to the phenotype file for the discovery dataset. Assumes no header and that the binary trait is in the third column. |
discovery_geno_mat |
Character. Path to the genotype matrix file for the discovery dataset. Assumes no header. |
target_pheno |
Character. Path to the phenotype file for the target dataset. Assumes no header and individual IDs in the second column. |
target_geno_mat |
Character. Path to the genotype matrix file for the target dataset. Assumes no header. |
significance_level |
Numeric. Significance level for confidence intervals (e.g., 0.05 for 95% CI). Default is 0.05. |
max_iterations |
Integer. Maximum number of iterations allowed in Firth logistic regression. Default is 100. |
Details
The function fits a Firth logistic regression model using the logistf
package to reduce small-sample bias in the discovery set.
It extracts SNP effect estimates and their covariance matrix, and propagates this uncertainty through to the individual-level
PRS in the target dataset via the delta method. Confidence intervals are derived assuming normality.
Missing or non-estimable coefficients and variances are set to zero.
Value
A data frame with the following columns:
- IID
Individual identifier (from the target phenotype file).
- PRS
Estimated polygenic risk score for each individual.
- Variance
Estimated variance of the PRS.
- Lower_Limit
Lower bound of the confidence interval.
- Upper_Limit
Upper bound of the confidence interval.
Examples
bpd <- system.file("Bpd_0_1.txt", package = "iPRSue", mustWork = TRUE)
bpt <- system.file("Bpt.txt", package = "iPRSue", mustWork = TRUE)
gd <- system.file("Gd.txt", package = "iPRSue", mustWork = TRUE)
gt <- system.file("Gt.txt", package = "iPRSue", mustWork = TRUE)
results <- BLUE_estimates_BT(
discovery_pheno = bpd,
discovery_geno_mat = gd,
target_pheno = bpt,
target_geno_mat = gt,
significance_level = 0.05,
max_iterations = 100
)
head(results)
BLUE_estimates_QT function
Description
Estimates individual-level polygenic risk scores (PRS) with uncertainty using a frequentist approach for quantitative traits. This implementation fits a multiple linear regression model in the discovery dataset, computes the coefficient covariance matrix, and applies the delta method to propagate uncertainty to the target dataset.
Usage
BLUE_estimates_QT(
discovery_pheno,
discovery_geno_mat,
target_pheno,
target_geno_mat,
significance_level = 0.05
)
Arguments
discovery_pheno |
Character. Path to the phenotype file for the discovery dataset. Assumes no header and that the quantitative trait is in the third column. |
discovery_geno_mat |
Character. Path to the genotype matrix file for the discovery dataset. Assumes no header. |
target_pheno |
Character. Path to the phenotype file for the target dataset. Assumes no header and individual IDs in the second column. |
target_geno_mat |
Character. Path to the genotype matrix file for the target dataset. Assumes no header. |
significance_level |
Numeric. Significance level for confidence intervals (e.g., 0.05 for 95% CI). Default is 0.05. |
Details
The function fits a multiple linear regression model (lm
) using the discovery data. The estimated SNP effects
and their covariance matrix are used to compute PRS and associated uncertainty for each individual in the target dataset.
Confidence intervals are constructed using the normal approximation.
Missing or non-estimable coefficients and variances are set to zero.
Value
A data frame with the following columns:
- IID
Individual identifier (from the target phenotype file).
- PRS
Estimated polygenic risk score for each individual.
- Variance
Estimated variance of the PRS.
- Lower_Limit
Lower bound of the confidence interval.
- Upper_Limit
Upper bound of the confidence interval.
Examples
qpd <- system.file("Qpd.txt", package = "iPRSue", mustWork = TRUE)
qpt <- system.file("Qpt.txt", package = "iPRSue", mustWork = TRUE)
gd <- system.file("Gd.txt", package = "iPRSue", mustWork = TRUE)
gt <- system.file("Gt.txt", package = "iPRSue", mustWork = TRUE)
results <- BLUE_estimates_QT(
discovery_pheno = qpd,
discovery_geno_mat = gd,
target_pheno = qpt,
target_geno_mat = gt,
significance_level = 0.05
)
head(results)
GWAS_BT function
Description
Performs genome-wide association study (GWAS) using plink2 logistic model and outputs the GWAS summary statistics with additive SNP effects (beta) and standard errors (se)
Usage
GWAS_BT(plink_path, b_file, discovery_pheno, discovery_cov, thread = 20)
Arguments
plink_path |
Path to the PLINK executable application |
b_file |
Prefix of the binary files, where all .fam, .bed and .bim files have a common prefix |
discovery_pheno |
Name (with file extension) of the phenotype file containing family ID, individual ID and phenotype of the discovery dataset as columns, without heading |
discovery_cov |
Name (with file extension) of the covariate file containing family ID, individual ID, and covariate(s) of the discovery dataset as columns, without heading. If no covariate is used, have a constant column (e.g. vector of 1s) |
thread |
Number of threads used (default = 20) |
Details
The function uses logistic regression to regress the binary phenotype (1/2 coding for controls/cases) on each SNP separately using plink 2. Then the estimated effects and standard errors are adjusted for standardization. It is optional to employ covariates in the model. If no covariate is used, create your covariate file with a constant in the 3rd column (e.g. vector of 1s).
Value
A data frame with two columns:
- beta
Estimated effect size (log odds) for each SNP.
- se
Standard error of the estimated effect size.
Examples
## Not run:
results <- GWAS_BT(
plink_path = "./plink2",
b_file = "./binary_file_prefix",
discovery_pheno = "./discovery_phenotype_file",
discovery_cov = "./discovery_covariate_file",
thread = 48
)
head(gwas_results)
## End(Not run)
GWAS_QT function
Description
Performs genome-wide association study (GWAS) using plink2 linear model and outputs the GWAS summary statistics with additive SNP effects (beta) and standard errors (se)
Usage
GWAS_QT(plink_path, b_file, discovery_pheno, discovery_cov, thread = 20)
Arguments
plink_path |
Path to the PLINK executable application |
b_file |
Prefix of the binary files, where all .fam, .bed and .bim files have a common prefix |
discovery_pheno |
Name (with file extension) of the phenotype file containing family ID, individual ID and phenotype of the discovery dataset as columns, without heading |
discovery_cov |
Name (with file extension) of the covariate file containing family ID, individual ID, and covariate(s) of the discovery dataset as columns, without heading. If no covariate is used, have a constant column (e.g. vector of 1s) |
thread |
Number of threads used (default = 20) |
Details
The function uses linear regression to regress the quantitative phenotype on each SNP separately using plink 2. Then the estimated effects and standard errors are adjusted for standardization. The phenotype is standardized prior to analysis. It is optional to employ covariates in the model. If no covariate is used, create your covariate file with a constant in the 3rd column (e.g. vector of 1s).
Value
A data frame with two columns:
- beta
Estimated effect size from linear regression.
- se
Standard error of the effect size estimate.
Examples
## Not run:
results <- GWAS_QT(
plink_path = "./plink2",
b_file = "./binary_file_prefix",
discovery_pheno = "./discovery_phenotype_file",
discovery_cov = "./discovery_covariate_file",
thread = 48
)
head(results)
## End(Not run)
iPRSue_estimates_BT function
Description
Computes individual-level polygenic risk scores (PRS) with uncertainty estimates using a simulation-based approach for binary traits. This implementation follows the iPRSue framework, simulating multiple PRSs by sampling from the GWAS effect size distribution and deriving individual-level confidence intervals.
Usage
iPRSue_estimates_BT(
gwas,
target_pheno,
target_geno_mat,
no_of_PRSs = 500,
significance_level = 0.05,
seed = NULL
)
Arguments
gwas |
A data frame with GWAS summary statistics for binary traits. Must contain |
target_pheno |
Character. Path to the target phenotype file. Assumes no header and individual IDs in the second column. |
target_geno_mat |
Character. Path to the genotype matrix of target individuals. No header is expected; columns correspond to SNPs. |
no_of_PRSs |
Integer. Number of simulations used to construct PRS uncertainty intervals. Default is 500. |
significance_level |
Numeric. Significance level for confidence intervals (e.g., 0.05 gives 95% CI). Default is 0.05. |
seed |
Integer or NULL. Random seed for reproducibility. If NULL, results may vary across runs. Default is NULL. |
Details
For each SNP, the function simulates no_of_PRSs
effect sizes from a normal distribution defined by its GWAS beta and SE.
These sampled betas are multiplied by the genotype matrix to generate PRS replicates for each individual.
Confidence intervals are then calculated using the specified significance level.
This function is designed for binary traits and should be used with GWAS summary statistics obtained from logistic regression.
Value
A data frame containing the following columns:
- IID
Individual identifier (from target phenotype file).
- PRS
Mean of simulated PRSs for each individual.
- Variance
Variance across simulated PRSs.
- Lower_Limit
Lower bound of the confidence interval.
- Upper_Limit
Upper bound of the confidence interval.
Examples
## Not run:
# Step 1: Run GWAS on binary trait
results <- GWAS_BT(
plink_path = "./plink2",
b_file = "./binary_file_prefix",
discovery_pheno = "./discovery_phenotype_file",
discovery_cov = "./discovery_covariate_file",
thread = 48
)
# Step 2: Estimate individual PRS with uncertainty
bpt <- system.file("Bpt.txt", package = "iPRSue", mustWork = TRUE)
gt <- system.file("Gt.txt", package = "iPRSue", mustWork = TRUE)
prs_estimates <- iPRSue_estimates_BT(
gwas = results,
target_pheno = bpt,
target_geno_mat = gt,
no_of_PRSs = 500,
significance_level = 0.05,
seed = 123
)
head(prs_estimates)
## End(Not run)
iPRSue_estimates_QT function
Description
Computes individual-level polygenic risk scores (PRS) with uncertainty estimates using a simulation-based approach for quantitative traits. This implementation follows the iPRSue framework, simulating multiple PRSs by sampling from the GWAS effect size distribution and deriving individual-level confidence intervals.
Usage
iPRSue_estimates_QT(
gwas,
target_pheno,
target_geno_mat,
no_of_PRSs = 500,
significance_level = 0.05,
seed = NULL
)
Arguments
gwas |
A data frame with GWAS summary statistics for a quantitative trait. Must contain |
target_pheno |
Character. Path to the target phenotype file. Assumes no header and individual IDs in the second column. |
target_geno_mat |
Character. Path to the genotype matrix of target individuals. No header is expected; columns correspond to SNPs. |
no_of_PRSs |
Integer. Number of simulations used to construct PRS uncertainty intervals. Default is 500. |
significance_level |
Numeric. Significance level for confidence intervals (e.g., 0.05 gives 95% CI). Default is 0.05. |
seed |
Integer or NULL. Random seed for reproducibility. If NULL, results may vary across runs. Default is NULL. |
Details
For each SNP, the function simulates no_of_PRSs
effect sizes from a normal distribution defined by its GWAS beta and SE.
These sampled betas are multiplied by the genotype matrix to generate PRS replicates for each individual.
Confidence intervals are then calculated using the specified significance level.
This function is designed for quantitative traits and should be used with GWAS summary statistics obtained from linear regression.
Value
A data frame containing the following columns:
- IID
Individual identifier (from target phenotype file).
- PRS
Mean of simulated PRSs for each individual.
- Variance
Variance across simulated PRSs.
- Lower_Limit
Lower bound of the confidence interval.
- Upper_Limit
Upper bound of the confidence interval.
Examples
## Not run:
# Step 1: Run GWAS on quantitative trait
results <- GWAS_QT(
plink_path = "./plink2",
b_file = "./binary_file_prefix",
discovery_pheno = "./discovery_phenotype_file",
discovery_cov = "./discovery_covariate_file",
thread = 48
)
# Step 2: Estimate individual PRS with uncertainty
qpt <- system.file("Qpt.txt", package = "iPRSue", mustWork = TRUE)
gt <- system.file("Gt.txt", package = "iPRSue", mustWork = TRUE)
prs_estimates <- iPRSue_estimates_QT(
gwas = results,
target_pheno = qpt,
target_geno_mat = gt,
no_of_PRSs = 500,
significance_level = 0.05,
seed = 123
)
head(prs_estimates)
## End(Not run)