Chapter 10: Case Study — Building Custom GLM Kernels (ex

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Chapter 10: Case Study — Building Custom GLM Kernels (ex_glmbayes)

Kjell Nygren

2026-06-11

Overview

Chapters 03-10 describe the infrastructure of nmathopencl: the shim layer, the nmath library, the kernel format, and the runner API. This chapter shows how to put all of it together by walking through a concrete downstream use-case: the ex_glmbayes example built into the package.

ex_glmbayes implements GPU-accelerated evaluation of the log-posterior and gradient (f2/f3) for Bayesian GLMs (Gaussian, Gamma, binomial with logit/probit/cloglog links, and Poisson). It is the type of component a package author would build on top of nmathopencl to accelerate their own likelihood computations. The name ex_glmbayes signals that this is an example, not a core part of the library; it lives entirely in ex_glmbayes_*-prefixed files and within the ex_glmbayes C++ namespace.

Step 1 — Identify the nmath functions your kernel needs

For each GLM family, determine which nmath functions the GPU kernel must call:

Kernel	nmath function(s) called	Link function
`f2_f3_binomial_logit.cl`	`dbinom_raw`	logit
`f2_f3_binomial_probit.cl`	`dbinom_raw`, `pnorm5`, `dnorm4`	probit
`f2_f3_binomial_cloglog.cl`	`dbinom_raw`	cloglog
`f2_f3_gamma.cl`	`dgamma`	log
`f2_f3_gaussian.cl`	`dnorm4`	identity/log
`f2_f3_poisson.cl`	`lgamma` (OpenCL built-in)	log

For f2_f3_poisson, lgamma is an OpenCL built-in (part of the OpenCL 1.1 math specification) so no nmath files are needed at all. For the others, consult the @all_depends_nmath tag of each kernel file to get the full transitive dependency list.

Step 2 — Extract the minimal nmath subset

The @all_depends_nmath tag in each kernel file gives the complete transitive closure. Taking the union across the five non-Poisson kernels yields 20 files (17 nmath functions + Rmath, nmath, refactored as infrastructure):

Rmath, nmath, refactored,
chebyshev, cospi, fmax2, gammalims, lgammacor,
gamma, lgamma,
stirlerr_cycle_free, pgamma_utils, stirlerr_cycle_dependent, stirlerr, bd0,
dpois, dgamma, dnorm, pnorm, dbinom

These files are stored in inst/cl/ex_glmbayes_nmath/. Rather than copying them manually, use extract_library_subset() to derive this directory automatically from the kernel annotations:

nmath_dir <- system.file("cl", "nmath",           package = "nmathopencl")
src_dir   <- system.file("cl", "ex_glmbayes_src", package = "nmathopencl")

# Load the pre-built dependency index (read once, reuse across calls)
idx <- readRDS(file.path(nmath_dir, "kernel_dependency_index.rds"))

result <- extract_library_subset(
    kernel_paths = list.files(src_dir, pattern = "\\.cl$", full.names = TRUE),
    library_dir  = nmath_dir,
    dest_dir     = "inst/cl/ex_glmbayes_nmath",
    depends_tag  = "depends_nmath",
    index        = idx
)

extract_library_subset() copies the 20 .cl files plus both index files (kernel_dependency_index.rds and kernel_dependency_index.tsv) into the destination directory. The index files are required for the indexed loading described in Step 5.

A package developer would create an analogous subdirectory for their own kernel set using the same function.

Step 3 — Write the kernel files

Each kernel file in inst/cl/ex_glmbayes_src/ starts with the dependency metadata block followed by the kernel body:

// @library_deps: nmath
// @calls_nmath: dbinom_raw
// @depends_nmath: dbinom
// @all_depends_nmath_count: 17
// @all_depends_nmath: Rmath, nmath, refactored, chebyshev, cospi, fmax2,
//   gammalims, lgammacor, gamma, lgamma, stirlerr_cycle_free, pgamma_utils,
//   stirlerr_cycle_dependent, stirlerr, bd0, dbinom
// @calls_opencl_builtin: (none)

#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#pragma OPENCL EXTENSION cl_khr_printf : enable

#define MAX_L2 64   // upper bound on l2

__kernel void f2_f3_binomial_logit(
    __global const double* X,      // design matrix,  size = l1 x l2, col-major
    __global const double* B,      // grid,            size = m1 x l2
    __global const double* mu,     // prior mean,      length = l2
    __global const double* P,      // prior precision, size = l2 x l2
    __global const double* alpha,  // offset,          length = l1
    __global const double* y,      // response,        length = l1
    __global const double* wt,     // weights,         length = l1
    __global double*       qf,     // out: neg log-posterior, length = m1
    __global double*       grad,   // out: gradient,   size = m1 x l2
    const int l1,
    const int l2,
    const int m1
) {
    int j = get_global_id(0);
    if (j >= m1) return;

    // prior quadratic form: 0.5 * (B_j - mu)' P (B_j - mu)
    double tmp[MAX_L2];
    for (int k = 0; k < l2; ++k) {
        double acc = 0.0;
        for (int l = 0; l < l2; ++l)
            acc += P[k*l2 + l] * (B[j*l2 + l] - mu[l]);
        tmp[k] = acc;
    }
    double res_acc = 0.0;
    for (int k = 0; k < l2; ++k)
        res_acc += 0.5 * (B[j*l2 + k] - mu[k]) * tmp[k];

    // gradient: prior part
    double g_loc[MAX_L2];
    for (int k = 0; k < l2; ++k) g_loc[k] = tmp[k];

    // data likelihood: logistic link + dbinom_raw
    for (int i = 0; i < l1; ++i) {
        double dot = -alpha[i];
        for (int k = 0; k < l2; ++k)
            dot -= X[k*l1 + i] * B[j*l2 + k];
        double e = exp(dot <= 0 ? dot : -dot);
        double p = dot <= 0 ? 1.0/(1.0+e) : e/(1.0+e);
        double q = 1.0 - p;
        res_acc += -dbinom_raw(y[i], wt[i], p, q, 1);
        double resid = p < 0.5 ? p*wt[i] - y[i]*wt[i]
                                : (1-y[i])*wt[i] - q*wt[i];
        for (int k = 0; k < l2; ++k)
            g_loc[k] += X[k*l1 + i] * resid;
    }

    qf[j] = res_acc;
    for (int k = 0; k < l2; ++k)
        grad[k * m1 + j] = g_loc[k];
}

Key design choices:

MAX_L2 64 — a compile-time upper bound on the number of predictors, used to size local arrays (OpenCL C does not support variable-length arrays in all implementations).
Column-major layout for X (matches R’s memory layout for matrices passed from R to C without transposition).
A single work-item per grid point j; all l1 observations are processed sequentially within that work-item. This is appropriate when m1 (number of grid points) is much larger than l1 (number of observations), as is typical for envelope sampling.

Step 4 — Write the kernel runner (C++)

The kernel runner is the low-level C++ function that manages the OpenCL lifecycle for this specific kernel. It lives in ex_glmbayes_kernel_runners.cpp inside the ex_glmbayes::opencl namespace:

// ex_glmbayes_kernel_runners.cpp (abbreviated)
namespace ex_glmbayes {
namespace opencl {

void f2_f3_kernel_runner(
    const std::string& kernel_source,
    const char*        kernel_name,
    int l1, int l2, int m1,
    const std::vector<double>& X_flat,
    const std::vector<double>& B_flat,
    const std::vector<double>& mu_flat,
    const std::vector<double>& P_flat,
    const std::vector<double>& alpha_flat,
    const std::vector<double>& y_flat,
    const std::vector<double>& wt_flat,
    std::vector<double>& qf_flat,
    std::vector<double>& grad_flat
) {
    // Select platform and device
    // Create context, command queue, program, kernel
    // Create input buffers (CL_MEM_READ_ONLY) and output buffers (CL_MEM_WRITE_ONLY)
    // Set kernel arguments
    // Enqueue kernel with global_size = m1
    // Read back qf_flat and grad_flat
    // Release all resources
}

}} // namespace ex_glmbayes::opencl

The runner uses the error-handling utilities from openclPort.h (opencl_make_context_error, opencl_status_name) to produce informative exceptions on failure.

Step 5 — Write the kernel wrapper (C++)

The kernel wrapper is the R-facing [[Rcpp::export]] function. It validates and converts R inputs, selects the kernel name based on family/link, assembles the program string using the dependency index, calls the runner, and returns results as R objects. It lives in ex_glmbayes_kernel_wrappers.cpp:

// ex_glmbayes_kernel_wrappers.cpp (abbreviated)
namespace ex_glmbayes {
namespace opencl {

// [[Rcpp::export]]
Rcpp::List f2_f3_opencl(
    std::string family, std::string link,
    Rcpp::NumericMatrix b,  Rcpp::NumericVector y,
    Rcpp::NumericMatrix x,  Rcpp::NumericMatrix mu,
    Rcpp::NumericMatrix P,  Rcpp::NumericVector alpha,
    Rcpp::NumericVector wt, int progbar)
{
    int l1 = x.nrow(), l2 = x.ncol(), m1 = b.ncol();

    // 1. Flatten R matrices into contiguous vectors
    auto X_flat  = openclPort::flattenMatrix(x);
    auto B_flat  = openclPort::flattenMatrix(b);
    // ... (all inputs)

    // 2. Dispatch: set kernel_name and kernel_file based on family/link
    std::string kernel_name, kernel_file;
    if (family == "binomial" && link == "logit") {
        kernel_name = "f2_f3_binomial_logit";
        kernel_file = "ex_glmbayes_src/f2_f3_binomial_logit.cl";
    } // ... (all families)

    // 3. Assemble program source using the pre-built dependency index.
    //    load_library_for_kernel() reads @depends_nmath from the kernel file,
    //    looks up the transitive closure in kernel_dependency_index.tsv, and
    //    loads only the required subset in the correct order --- no sort at runtime.
    std::string nmath_src = openclPort::load_library_for_kernel(
        kernel_file,          // kernel .cl path (relative to inst/cl/)
        "ex_glmbayes_nmath",  // library subdir containing index + .cl files
        "nmathopencl",        // package
        "depends_nmath"       // annotation tag listing direct nmath entry points
    );
    std::string kernel_src = openclPort::load_kernel_source(
        kernel_file, "nmathopencl");
    std::string program_source = nmath_src + "\n" + kernel_src;

    // 4. Allocate output buffers
    std::vector<double> qf_flat(m1);
    std::vector<double> grad_flat(static_cast<size_t>(m1) * l2);

    // 5. Run the kernel
    f2_f3_kernel_runner(program_source, kernel_name.c_str(),
                        l1, l2, m1,
                        X_flat, B_flat, mu_flat, P_flat,
                        alpha_flat, y_flat, wt_flat,
                        qf_flat, grad_flat);

    // 6. Wrap outputs as R objects and return
    Rcpp::NumericVector qf(qf_flat.begin(), qf_flat.end());
    // ... (reshape grad_flat into a matrix)
    return Rcpp::List::create(Rcpp::Named("qf") = qf, ...);
}

}} // namespace ex_glmbayes::opencl

Why `load_library_for_kernel()` rather than `load_kernel_library()`?

load_kernel_library() sorts and concatenates every file in the subdirectory on every call. For a 137-file library like nmath this requires reading all files sequentially. load_library_for_kernel() reads the pre-built kernel_dependency_index.tsv once, then reads only the small subset of files that the specific kernel actually needs — 3 files for f2_f3_gaussian, 17 for f2_f3_binomial_logit, and so on. The savings are significant for large libraries and become important when the kernel wrapper is called repeatedly.

The old full-library approach is still available for cases where the exact dependency set is unknown or all files are genuinely needed:

// Old approach --- loads all files in the subdirectory (kept for reference)
// std::string nmath_src = openclPort::load_kernel_library(
//     "ex_glmbayes_nmath", "nmathopencl", false);

Step 6 — Write the Rcpp export wrappers (C++ and R)

The [[Rcpp::export]] annotation on f2_f3_opencl causes Rcpp::compileAttributes() to generate:

A .Call entry in RcppExports.cpp
An R wrapper in RcppExports.R

The R-side convenience function in ex_glmbayes.R adds a nmathopencl_has_opencl() guard and CPU fallback:

f2_f3_opencl_R <- function(family, link, b, y, x, mu, P, alpha, wt,
                             use_opencl = TRUE) {
  if (use_opencl && nmathopencl_has_opencl()) {
    f2_f3_opencl(family, link, b, y, x, mu, P, alpha, wt, progbar = 0L)
  } else {
    f2_f3_non_opencl(family, link, b, y, x, mu, P, alpha, wt)
  }
}

Step 7 — Implement the CPU fallback

The CPU fallback uses the same mathematical logic as the kernel but executes in C++ using standard R math calls. The relevant code lives in:

ex_glmbayes_famfuncs_binomial.cpp — binomial family functions
ex_glmbayes_famfuncs_poisson.cpp — Poisson family
ex_glmbayes_famfuncs_Gamma.cpp — Gamma family
ex_glmbayes_famfuncs_gaussian.cpp — Gaussian family
ex_glmbayes_EnvelopeEval.cpp — orchestrates the CPU path
ex_glmbayes_EnvelopeSize.cpp — memory allocation for the envelope

These files are grouped under the ex_glmbayes::fam and ex_glmbayes::env sub-namespaces. They use R’s standard <Rmath.h> functions (the same mathematical functions as the GPU kernels, but via the C runtime rather than OpenCL).

File inventory

The complete set of files contributing to the ex_glmbayes example:

OpenCL source (`inst/cl/`)

Directory	Files
`ex_glmbayes_nmath/`	20 files: the minimal nmath subset
`ex_glmbayes_src/`	6 files: one `f2_f3_*.cl` kernel per GLM family
`ex_glmbayes_draft_src/`	Draft versions (work in progress)

C++ source (`src/`)

File	Namespace	Role
`ex_glmbayes_kernel_runners.cpp`	`ex_glmbayes::opencl`	Low-level OpenCL runner for the `f2_f3` kernels
`ex_glmbayes_kernel_wrappers.cpp`	`ex_glmbayes::opencl`	R-facing wrapper: input validation, program assembly, runner dispatch
`ex_glmbayes_export_wrappers.cpp`	(Rcpp export)	`[[Rcpp::export]]` entry points for `EnvelopeSize`, `EnvelopeEval`, `glmb_Standardize_Model`
`ex_glmbayes_famfuncs_binomial.cpp`	`ex_glmbayes::fam`	CPU binomial likelihood
`ex_glmbayes_famfuncs_poisson.cpp`	`ex_glmbayes::fam`	CPU Poisson likelihood
`ex_glmbayes_famfuncs_Gamma.cpp`	`ex_glmbayes::fam`	CPU Gamma likelihood
`ex_glmbayes_famfuncs_gaussian.cpp`	`ex_glmbayes::fam`	CPU Gaussian likelihood
`ex_glmbayes_EnvelopeEval.cpp`	`ex_glmbayes::env`	CPU envelope evaluation (f2/f3 orchestrator)
`ex_glmbayes_EnvelopeSize.cpp`	`ex_glmbayes::env`	Envelope memory sizing
`ex_glmbayes_glmb_Standardize_Model.cpp`	`ex_glmbayes::sim`	Design matrix standardization

C++ headers (`src/`)

File	Namespace declared
`ex_glmbayes_opencl.h`	`ex_glmbayes::opencl`
`ex_glmbayes_famfuncs.h`	`ex_glmbayes::fam`
`ex_glmbayes_Envelopefuncs.h`	`ex_glmbayes::env`
`ex_glmbayes_simfuncs.h`	`ex_glmbayes::sim`

R source (`R/`)

File	Role
`ex_glmbayes.R`	R-facing API: `f2_f3_opencl_R`, `f2_f3_non_opencl`
`ex_glmbayes_rcpp_wrappers.R`	Thin Rcpp wrappers for the three exported C++ export functions

Adapting this pattern for a new package

To build a similar component in your own package:

Identify your nmath dependencies — write the @depends_nmath annotation in each of your kernel files (listing the direct nmath entry points your kernel calls). Then call extract_library_subset() to copy the exact transitive closure of files, plus the companion index files, into your package’s inst/cl/<yourpkg_nmath>/:

nmath_dir <- system.file("cl", "nmath", package = "nmathopencl")
idx <- readRDS(file.path(nmath_dir, "kernel_dependency_index.rds"))

extract_library_subset(
    kernel_paths = list.files("inst/cl/mypkg_src", "\\.cl$", full.names = TRUE),
    library_dir  = nmath_dir,
    dest_dir     = "inst/cl/mypkg_nmath",  # must exist
    depends_tag  = "depends_nmath",
    index        = idx
)

Write your kernel(s) — add @library_deps, @calls_nmath, and @all_depends_nmath tags at the top of each .cl file. Store them in inst/cl/<yourpkg_src>/.
Write a kernel runner — use openclPort::opencl_dbl_scalar_kernel_runner directly if your kernel fits the double-scalar layout, or write a custom runner following the same OpenCL lifecycle pattern as f2_f3_kernel_runner. Use openclPort::opencl_make_context_error and opencl_status_name for error handling.
Write a kernel wrapper — flatten your R inputs with openclPort::flattenMatrix / openclPort::copyVector, assemble the program string with openclPort::load_library_for_kernel() + openclPort::load_kernel_source(), call the runner, and return an R-friendly result. The @depends_nmath annotation in your kernel file drives the subset selection automatically.
Write a CPU fallback — implement the same computation using standard R math (<Rmath.h>) or base-R functions. Guard the GPU path with nmathopencl_has_opencl().
Add LinkingTo: nmathopencl to your DESCRIPTION file so that openclPort.h is found at compile time.
Add configure scripts for CRAN safety — this is the step most commonly overlooked.

A package that references -lOpenCL or CL/cl.h in a static src/Makevars will fail to compile on CRAN’s build machines (which have no GPU SDK), and no binary will be produced. The solution is a pair of configure scripts (configure for Linux/macOS, configure.win for Windows) that generate src/Makevars dynamically at install time: if the SDK is found they write -DUSE_OPENCL and the link flags; if not they write a CPU-only Makevars. The package always compiles cleanly.

Copy the generic templates into your package root with:
```
use_opencl_configure()
```
This writes configure and configure.win to the current directory, sets configure as executable (on Linux/macOS), and prints a checklist. The templates honor OPENCL_HOME / OPENCL_SDK environment variables so developers with non-standard SDK locations can point to them without modifying system paths.

The three-entity relationship that the scripts establish:
```
configure / configure.win
  -> detects CL/cl.h + libOpenCL at install time
  -> writes -DUSE_OPENCL into Makevars   (or omits it for CPU-only)

#ifdef USE_OPENCL  (in your C++ source)
  -> guards all GPU code; compiles cleanly either way

nmathopencl_has_opencl()  (in your R code)
  -> calls a compiled-in bool that mirrors the compile-time decision
  -> TRUE only if -DUSE_OPENCL was set when the package was built
```
On Linux the configure script also runs a runtime probe (clGetPlatformIDs) before setting USE_OPENCL. This extra check catches the common case where the ICD loader (libOpenCL.so) is installed but no vendor platform has been registered in /etc/OpenCL/vendors/. configure.win (Windows) detects the header only; the ICD loader (OpenCL.dll) is installed with the GPU driver and does not require a separate probe.

Add the generated files to .gitignore — they are build artifacts, not source files:
```
src/Makevars
src/Makevars.win
```
For more detail, including the migration plan for when these templates move to opencltools, see system.file("configure-templates", "README.md", package = "nmathopencl").

The ex_glmbayes files are the authoritative reference implementation for steps 1–6; step 7 applies universally to any downstream package that exposes GPU acceleration.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.