The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
Chapters 03-10 describe the infrastructure of
nmathopencl: the shim layer, the nmath
library, the kernel format, and the runner API. This chapter shows how
to put all of it together by walking through a concrete downstream
use-case: the ex_glmbayes example built into the
package.
ex_glmbayes implements GPU-accelerated evaluation of the
log-posterior and gradient (f2/f3) for
Bayesian GLMs (Gaussian, Gamma, binomial with logit/probit/cloglog
links, and Poisson). It is the type of component a package author would
build on top of nmathopencl to accelerate their own
likelihood computations. The name ex_glmbayes signals that
this is an example, not a core part of the library; it lives entirely in
ex_glmbayes_*-prefixed files and within the
ex_glmbayes C++ namespace.
For each GLM family, determine which nmath functions the GPU kernel must call:
| Kernel | nmath function(s) called | Link function |
|---|---|---|
f2_f3_binomial_logit.cl |
dbinom_raw |
logit |
f2_f3_binomial_probit.cl |
dbinom_raw, pnorm5,
dnorm4 |
probit |
f2_f3_binomial_cloglog.cl |
dbinom_raw |
cloglog |
f2_f3_gamma.cl |
dgamma |
log |
f2_f3_gaussian.cl |
dnorm4 |
identity/log |
f2_f3_poisson.cl |
lgamma (OpenCL built-in) |
log |
For f2_f3_poisson, lgamma is an OpenCL
built-in (part of the OpenCL 1.1 math specification) so no nmath files
are needed at all. For the others, consult the
@all_depends_nmath tag of each kernel file to get the full
transitive dependency list.
The @all_depends_nmath tag in each kernel file gives the
complete transitive closure. Taking the union across the five
non-Poisson kernels yields 20 files (17 nmath functions +
Rmath, nmath, refactored as
infrastructure):
Rmath, nmath, refactored,
chebyshev, cospi, fmax2, gammalims, lgammacor,
gamma, lgamma,
stirlerr_cycle_free, pgamma_utils, stirlerr_cycle_dependent, stirlerr, bd0,
dpois, dgamma, dnorm, pnorm, dbinom
These files are stored in inst/cl/ex_glmbayes_nmath/.
Rather than copying them manually, use
extract_library_subset() to derive this directory
automatically from the kernel annotations:
nmath_dir <- system.file("cl", "nmath", package = "nmathopencl")
src_dir <- system.file("cl", "ex_glmbayes_src", package = "nmathopencl")
# Load the pre-built dependency index (read once, reuse across calls)
idx <- readRDS(file.path(nmath_dir, "kernel_dependency_index.rds"))
result <- extract_library_subset(
kernel_paths = list.files(src_dir, pattern = "\\.cl$", full.names = TRUE),
library_dir = nmath_dir,
dest_dir = "inst/cl/ex_glmbayes_nmath",
depends_tag = "depends_nmath",
index = idx
)extract_library_subset() copies the 20 .cl
files plus both index files
(kernel_dependency_index.rds and
kernel_dependency_index.tsv) into the destination
directory. The index files are required for the indexed loading
described in Step 5.
A package developer would create an analogous subdirectory for their own kernel set using the same function.
Each kernel file in inst/cl/ex_glmbayes_src/ starts with
the dependency metadata block followed by the kernel body:
// @library_deps: nmath
// @calls_nmath: dbinom_raw
// @depends_nmath: dbinom
// @all_depends_nmath_count: 17
// @all_depends_nmath: Rmath, nmath, refactored, chebyshev, cospi, fmax2,
// gammalims, lgammacor, gamma, lgamma, stirlerr_cycle_free, pgamma_utils,
// stirlerr_cycle_dependent, stirlerr, bd0, dbinom
// @calls_opencl_builtin: (none)
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#pragma OPENCL EXTENSION cl_khr_printf : enable
#define MAX_L2 64 // upper bound on l2
__kernel void f2_f3_binomial_logit(
__global const double* X, // design matrix, size = l1 x l2, col-major
__global const double* B, // grid, size = m1 x l2
__global const double* mu, // prior mean, length = l2
__global const double* P, // prior precision, size = l2 x l2
__global const double* alpha, // offset, length = l1
__global const double* y, // response, length = l1
__global const double* wt, // weights, length = l1
__global double* qf, // out: neg log-posterior, length = m1
__global double* grad, // out: gradient, size = m1 x l2
const int l1,
const int l2,
const int m1
) {
int j = get_global_id(0);
if (j >= m1) return;
// prior quadratic form: 0.5 * (B_j - mu)' P (B_j - mu)
double tmp[MAX_L2];
for (int k = 0; k < l2; ++k) {
double acc = 0.0;
for (int l = 0; l < l2; ++l)
acc += P[k*l2 + l] * (B[j*l2 + l] - mu[l]);
tmp[k] = acc;
}
double res_acc = 0.0;
for (int k = 0; k < l2; ++k)
res_acc += 0.5 * (B[j*l2 + k] - mu[k]) * tmp[k];
// gradient: prior part
double g_loc[MAX_L2];
for (int k = 0; k < l2; ++k) g_loc[k] = tmp[k];
// data likelihood: logistic link + dbinom_raw
for (int i = 0; i < l1; ++i) {
double dot = -alpha[i];
for (int k = 0; k < l2; ++k)
dot -= X[k*l1 + i] * B[j*l2 + k];
double e = exp(dot <= 0 ? dot : -dot);
double p = dot <= 0 ? 1.0/(1.0+e) : e/(1.0+e);
double q = 1.0 - p;
res_acc += -dbinom_raw(y[i], wt[i], p, q, 1);
double resid = p < 0.5 ? p*wt[i] - y[i]*wt[i]
: (1-y[i])*wt[i] - q*wt[i];
for (int k = 0; k < l2; ++k)
g_loc[k] += X[k*l1 + i] * resid;
}
qf[j] = res_acc;
for (int k = 0; k < l2; ++k)
grad[k * m1 + j] = g_loc[k];
}Key design choices:
MAX_L2 64 — a compile-time upper bound on the number of
predictors, used to size local arrays (OpenCL C does not support
variable-length arrays in all implementations).X (matches R’s memory layout
for matrices passed from R to C without transposition).j; all
l1 observations are processed sequentially within that
work-item. This is appropriate when m1 (number of grid
points) is much larger than l1 (number of observations), as
is typical for envelope sampling.The kernel runner is the low-level C++ function that manages the
OpenCL lifecycle for this specific kernel. It lives in
ex_glmbayes_kernel_runners.cpp inside the
ex_glmbayes::opencl namespace:
// ex_glmbayes_kernel_runners.cpp (abbreviated)
namespace ex_glmbayes {
namespace opencl {
void f2_f3_kernel_runner(
const std::string& kernel_source,
const char* kernel_name,
int l1, int l2, int m1,
const std::vector<double>& X_flat,
const std::vector<double>& B_flat,
const std::vector<double>& mu_flat,
const std::vector<double>& P_flat,
const std::vector<double>& alpha_flat,
const std::vector<double>& y_flat,
const std::vector<double>& wt_flat,
std::vector<double>& qf_flat,
std::vector<double>& grad_flat
) {
// Select platform and device
// Create context, command queue, program, kernel
// Create input buffers (CL_MEM_READ_ONLY) and output buffers (CL_MEM_WRITE_ONLY)
// Set kernel arguments
// Enqueue kernel with global_size = m1
// Read back qf_flat and grad_flat
// Release all resources
}
}} // namespace ex_glmbayes::openclThe runner uses the error-handling utilities from
openclPort.h (opencl_make_context_error,
opencl_status_name) to produce informative exceptions on
failure.
The kernel wrapper is the R-facing [[Rcpp::export]]
function. It validates and converts R inputs, selects the kernel name
based on family/link, assembles the program string using the dependency
index, calls the runner, and returns results as R objects. It lives in
ex_glmbayes_kernel_wrappers.cpp:
// ex_glmbayes_kernel_wrappers.cpp (abbreviated)
namespace ex_glmbayes {
namespace opencl {
// [[Rcpp::export]]
Rcpp::List f2_f3_opencl(
std::string family, std::string link,
Rcpp::NumericMatrix b, Rcpp::NumericVector y,
Rcpp::NumericMatrix x, Rcpp::NumericMatrix mu,
Rcpp::NumericMatrix P, Rcpp::NumericVector alpha,
Rcpp::NumericVector wt, int progbar)
{
int l1 = x.nrow(), l2 = x.ncol(), m1 = b.ncol();
// 1. Flatten R matrices into contiguous vectors
auto X_flat = openclPort::flattenMatrix(x);
auto B_flat = openclPort::flattenMatrix(b);
// ... (all inputs)
// 2. Dispatch: set kernel_name and kernel_file based on family/link
std::string kernel_name, kernel_file;
if (family == "binomial" && link == "logit") {
kernel_name = "f2_f3_binomial_logit";
kernel_file = "ex_glmbayes_src/f2_f3_binomial_logit.cl";
} // ... (all families)
// 3. Assemble program source using the pre-built dependency index.
// load_library_for_kernel() reads @depends_nmath from the kernel file,
// looks up the transitive closure in kernel_dependency_index.tsv, and
// loads only the required subset in the correct order --- no sort at runtime.
std::string nmath_src = openclPort::load_library_for_kernel(
kernel_file, // kernel .cl path (relative to inst/cl/)
"ex_glmbayes_nmath", // library subdir containing index + .cl files
"nmathopencl", // package
"depends_nmath" // annotation tag listing direct nmath entry points
);
std::string kernel_src = openclPort::load_kernel_source(
kernel_file, "nmathopencl");
std::string program_source = nmath_src + "\n" + kernel_src;
// 4. Allocate output buffers
std::vector<double> qf_flat(m1);
std::vector<double> grad_flat(static_cast<size_t>(m1) * l2);
// 5. Run the kernel
f2_f3_kernel_runner(program_source, kernel_name.c_str(),
l1, l2, m1,
X_flat, B_flat, mu_flat, P_flat,
alpha_flat, y_flat, wt_flat,
qf_flat, grad_flat);
// 6. Wrap outputs as R objects and return
Rcpp::NumericVector qf(qf_flat.begin(), qf_flat.end());
// ... (reshape grad_flat into a matrix)
return Rcpp::List::create(Rcpp::Named("qf") = qf, ...);
}
}} // namespace ex_glmbayes::openclload_library_for_kernel() rather than
load_kernel_library()?load_kernel_library() sorts and concatenates
every file in the subdirectory on every call. For a
137-file library like nmath this requires reading all files
sequentially. load_library_for_kernel() reads the pre-built
kernel_dependency_index.tsv once, then reads only the small
subset of files that the specific kernel actually needs — 3 files for
f2_f3_gaussian, 17 for f2_f3_binomial_logit,
and so on. The savings are significant for large libraries and become
important when the kernel wrapper is called repeatedly.
The old full-library approach is still available for cases where the exact dependency set is unknown or all files are genuinely needed:
The [[Rcpp::export]] annotation on
f2_f3_opencl causes Rcpp::compileAttributes()
to generate:
.Call entry in RcppExports.cppRcppExports.RThe R-side convenience function in ex_glmbayes.R adds a
nmathopencl_has_opencl() guard and CPU fallback:
The CPU fallback uses the same mathematical logic as the kernel but executes in C++ using standard R math calls. The relevant code lives in:
ex_glmbayes_famfuncs_binomial.cpp — binomial family
functionsex_glmbayes_famfuncs_poisson.cpp — Poisson familyex_glmbayes_famfuncs_Gamma.cpp — Gamma familyex_glmbayes_famfuncs_gaussian.cpp — Gaussian
familyex_glmbayes_EnvelopeEval.cpp — orchestrates the CPU
pathex_glmbayes_EnvelopeSize.cpp — memory allocation for
the envelopeThese files are grouped under the ex_glmbayes::fam and
ex_glmbayes::env sub-namespaces. They use R’s standard
<Rmath.h> functions (the same mathematical functions
as the GPU kernels, but via the C runtime rather than OpenCL).
The complete set of files contributing to the
ex_glmbayes example:
inst/cl/)| Directory | Files |
|---|---|
ex_glmbayes_nmath/ |
20 files: the minimal nmath subset |
ex_glmbayes_src/ |
6 files: one f2_f3_*.cl kernel per GLM family |
ex_glmbayes_draft_src/ |
Draft versions (work in progress) |
src/)| File | Namespace | Role |
|---|---|---|
ex_glmbayes_kernel_runners.cpp |
ex_glmbayes::opencl |
Low-level OpenCL runner for the f2_f3 kernels |
ex_glmbayes_kernel_wrappers.cpp |
ex_glmbayes::opencl |
R-facing wrapper: input validation, program assembly, runner dispatch |
ex_glmbayes_export_wrappers.cpp |
(Rcpp export) | [[Rcpp::export]] entry points for
EnvelopeSize, EnvelopeEval,
glmb_Standardize_Model |
ex_glmbayes_famfuncs_binomial.cpp |
ex_glmbayes::fam |
CPU binomial likelihood |
ex_glmbayes_famfuncs_poisson.cpp |
ex_glmbayes::fam |
CPU Poisson likelihood |
ex_glmbayes_famfuncs_Gamma.cpp |
ex_glmbayes::fam |
CPU Gamma likelihood |
ex_glmbayes_famfuncs_gaussian.cpp |
ex_glmbayes::fam |
CPU Gaussian likelihood |
ex_glmbayes_EnvelopeEval.cpp |
ex_glmbayes::env |
CPU envelope evaluation (f2/f3 orchestrator) |
ex_glmbayes_EnvelopeSize.cpp |
ex_glmbayes::env |
Envelope memory sizing |
ex_glmbayes_glmb_Standardize_Model.cpp |
ex_glmbayes::sim |
Design matrix standardization |
src/)| File | Namespace declared |
|---|---|
ex_glmbayes_opencl.h |
ex_glmbayes::opencl |
ex_glmbayes_famfuncs.h |
ex_glmbayes::fam |
ex_glmbayes_Envelopefuncs.h |
ex_glmbayes::env |
ex_glmbayes_simfuncs.h |
ex_glmbayes::sim |
R/)| File | Role |
|---|---|
ex_glmbayes.R |
R-facing API: f2_f3_opencl_R,
f2_f3_non_opencl |
ex_glmbayes_rcpp_wrappers.R |
Thin Rcpp wrappers for the three exported C++ export functions |
To build a similar component in your own package:
Identify your nmath dependencies — write the
@depends_nmath annotation in each of your kernel files
(listing the direct nmath entry points your kernel calls). Then call
extract_library_subset() to copy the exact transitive
closure of files, plus the companion index files, into your package’s
inst/cl/<yourpkg_nmath>/:
nmath_dir <- system.file("cl", "nmath", package = "nmathopencl")
idx <- readRDS(file.path(nmath_dir, "kernel_dependency_index.rds"))
extract_library_subset(
kernel_paths = list.files("inst/cl/mypkg_src", "\\.cl$", full.names = TRUE),
library_dir = nmath_dir,
dest_dir = "inst/cl/mypkg_nmath", # must exist
depends_tag = "depends_nmath",
index = idx
)Write your kernel(s) — add
@library_deps, @calls_nmath, and
@all_depends_nmath tags at the top of each .cl
file. Store them in inst/cl/<yourpkg_src>/.
Write a kernel runner — use
openclPort::opencl_dbl_scalar_kernel_runner directly if
your kernel fits the double-scalar layout, or write a custom runner
following the same OpenCL lifecycle pattern as
f2_f3_kernel_runner. Use
openclPort::opencl_make_context_error and
opencl_status_name for error handling.
Write a kernel wrapper — flatten your R inputs
with openclPort::flattenMatrix /
openclPort::copyVector, assemble the program string with
openclPort::load_library_for_kernel() +
openclPort::load_kernel_source(), call the runner, and
return an R-friendly result. The @depends_nmath annotation
in your kernel file drives the subset selection automatically.
Write a CPU fallback — implement the same
computation using standard R math (<Rmath.h>) or
base-R functions. Guard the GPU path with
nmathopencl_has_opencl().
Add LinkingTo: nmathopencl to your
DESCRIPTION file so that openclPort.h is found
at compile time.
Add configure scripts for CRAN safety — this is the step most commonly overlooked.
A package that references -lOpenCL or
CL/cl.h in a static src/Makevars will
fail to compile on CRAN’s build machines (which have no
GPU SDK), and no binary will be produced. The solution is a pair of
configure scripts (configure for Linux/macOS,
configure.win for Windows) that generate
src/Makevars dynamically at install time: if the
SDK is found they write -DUSE_OPENCL and the link flags; if
not they write a CPU-only Makevars. The package always compiles
cleanly.
Copy the generic templates into your package root with:
This writes configure and configure.win to
the current directory, sets configure as executable (on
Linux/macOS), and prints a checklist. The templates honor
OPENCL_HOME / OPENCL_SDK environment variables
so developers with non-standard SDK locations can point to them without
modifying system paths.
The three-entity relationship that the scripts establish:
configure / configure.win
-> detects CL/cl.h + libOpenCL at install time
-> writes -DUSE_OPENCL into Makevars (or omits it for CPU-only)
#ifdef USE_OPENCL (in your C++ source)
-> guards all GPU code; compiles cleanly either way
nmathopencl_has_opencl() (in your R code)
-> calls a compiled-in bool that mirrors the compile-time decision
-> TRUE only if -DUSE_OPENCL was set when the package was built
On Linux the configure script also runs a
runtime probe (clGetPlatformIDs) before
setting USE_OPENCL. This extra check catches the common
case where the ICD loader (libOpenCL.so) is installed but
no vendor platform has been registered in
/etc/OpenCL/vendors/. configure.win (Windows)
detects the header only; the ICD loader (OpenCL.dll) is
installed with the GPU driver and does not require a separate probe.
Add the generated files to .gitignore — they are build
artifacts, not source files:
src/Makevars
src/Makevars.win
For more detail, including the migration plan for when these
templates move to opencltools, see
system.file("configure-templates", "README.md", package = "nmathopencl").
The ex_glmbayes files are the authoritative reference
implementation for steps 1–6; step 7 applies universally to any
downstream package that exposes GPU acceleration.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.