The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
ggml/ggml.c and ggml/ggml-opt.cpp. Both now
include the same #ifdef USING_R macro block that
neutralizes printf, fprintf,
fputs, fflush, stderr, and
stdout. These calls were diagnostic-only and were already
silent at runtime via the installed log callback; now the symbols never
reach the compiled object files either.Grammar-constrained generation
(edge_grammar_completion()): Force model output to conform
to a GBNF grammar specification. Ensures valid, parseable structured
output (JSON, enums, numbers, etc.) using llama.cpp’s native grammar
sampler.
JSON schema helper
(edge_json_grammar()): Convert a simple R list schema into
a GBNF grammar string. Supports string, number, integer, boolean fields
and enum (character vector) constraints.
Structured data extraction
(edge_extract()): High-level function that combines prompt
construction with grammar-constrained generation to extract structured
data from text. Returns a parsed R list (requires
jsonlite).
Text classification
(edge_classify()): Classify text into predefined categories
using grammar constraints. Supports single text and batch (vectorized)
classification. Output is guaranteed to be one of the specified
categories.
Text embeddings
(edge_embeddings()): Extract dense vector embeddings from
any loaded model. Returns a numeric matrix (n_texts x n_embd) suitable
for clustering, semantic search, similarity computation, and RAG
pipelines. Supports optional L2 normalization.
Cosine similarity
(edge_similarity(), edge_similarity_matrix()):
Compute pairwise cosine similarity between embedding vectors. Matrix
version efficiently computes all-pairs similarity using normalized
matrix multiply.
Embedding dimension query
(edge_model_n_embd()): Query the embedding dimension of a
loaded model.
Batch processing (edge_map()):
Apply a prompt template over a vector of texts with progress reporting.
Supports both string templates with {text} placeholder and
custom prompt functions. Optional grammar constraint for structured
batch output.
Batch extraction
(edge_extract_batch()): Extract structured data from
multiple texts, returning a data frame with one row per input.
RAG document indexing
(edge_index_documents()): Build a semantic embedding index
from a directory of text files or a character vector. Automatic chunking
with configurable size and overlap.
RAG semantic search
(edge_search()): Find the most relevant text chunks for a
query using cosine similarity over the embedding index.
RAG question answering
(edge_ask()): Retrieval-augmented generation that retrieves
relevant context from an index and generates a grounded answer. Supports
custom system prompts and optional context return for
debugging/transparency.
Plumber API server (edge_serve()):
Serve a model as a local OpenAI-compatible REST API. Endpoints:
/v1/completions, /v1/chat/completions,
/v1/embeddings, /v1/models,
/health. Supports optional API key authentication and CORS.
Requires plumber.
Qwen3 model family in
edge_list_models(): Added Qwen3-0.6B, 1.7B, 4B, and 8B
pre-configured entries from the unsloth GGUF repository.
Friendly names in
edge_download_model(): Now accepts model names
from edge_list_models() (e.g.,
edge_download_model("Qwen3-0.6B")) in addition to
HuggingFace repo IDs. Filename is auto-resolved from the model
registry.
httr download fallback:
.robust_download() now tries httr::GET before
R’s download.file, improving reliability on corporate
networks with custom SSL certificates or proxy configurations.
SIMD optimization warning: On package load,
warns if running without SIMD (generic mode) and suggests reinstalling
from source with EDGEMODELR_SIMD=NATIVE for faster
inference.
Fixed grammar-constrained generation failures
(issue #41): edge_grammar_completion(),
edge_extract(), and edge_extract_batch() were
unusable due to two bugs. First, edge_json_grammar()
emitted rule names like field_1 containing underscores,
which llama.cpp’s grammar parser rejects (only [a-zA-Z0-9-]
is allowed in rule identifiers). Renamed to field-1.
Second, llama_sampler_accept() throws “Unexpected empty
grammar stack” when a token fully satisfies the grammar; the binding now
catches this and terminates cleanly, same as end-of-generation
handling.
Fixed crash from silent context size override
(issue #40 item 11): Removed the auto-reduction of n_ctx
for small models that silently changed the user’s requested context
size. This caused segfaults when prompts exceeded the reduced context.
Context is now used as-is. Minimum n_ctx lowered from 512
to 128 for short-task use cases.
Fixed prompt echo in completion output (issue
#40 item 1): edge_completion() previously returned
prompt + generated_text. Now returns only the generated
text, matching user expectations.
Added prompt length validation: All completion
functions now validate that the tokenized prompt fits within the model’s
context window before calling llama_decode(). Exceeding the
context now raises a clear R error instead of crashing the
process.
Model-native chat templates (issue #40 item 7):
New edge_chat_completion() function reads the model’s chat
template from GGUF metadata (via llama_chat_apply_template)
and formats messages correctly for each model architecture (ChatML,
Llama, Gemma, etc.). build_chat_prompt() updated to accept
an optional ctx parameter for native template formatting,
with ChatML as the generic fallback (replacing the old
Human:/Assistant: format).
edge_classify(ctx, text, c("positive", "negative", "neutral"))edge_extract(ctx, text, list(name = "string", role = "string"))edge_install_cuda() and
edge_install_cuda_toolkit() functions set up GPU inference
automatically.
edge_install_cuda() downloads the matching
ggml-cuda dynamic backend from llama.cpp releases and
extracts the companion ggml-base.dll /
ggml.dll runtime libraries.edge_install_cuda_toolkit() copies
nvcudart_hybrid64.dll from the Windows DriverStore (already
on any NVIDIA-driver machine, no download required) and fetches
cublas64 / cublasLt64 from NVIDIA’s redistrib
server.edge_reload_cuda() activates the CUDA backend in the
current R session without restarting R.edge_cuda_info() reports whether CUDA is installed and
active.n_gpu_layers = -1L to
edge_load_model() for full GPU offload.std::regex to spend 40+ minutes in exponential
backtracking. Added a hand-written fast path
unicode_regex_split_custom_qwen2() in
unicode.cpp, matching the logic of the existing llama-3
fast path. Qwen3-14B now loads in 0.3 s on CPU (3.4 s on GPU including
VRAM transfer). Covers QWEN2 and QWEN3.5 variants.abort() in ggml_abort() with
raise(SIGABRT) under #ifdef USING_R; replaces
abort() token in ggml.cpp with
std::terminate().ggml_print_backtrace() body and
fflush(stdout) / fprintf(stderr, …) in
ggml_abort() with #ifndef USING_R to remove
_Exit, stdout, and stderr symbol
references from ggml.o on macOS.#define _GNU_SOURCE to ggml-cpu.c
(required for SCHED_BATCH, CPU_ZERO,
pthread_setaffinity_np on Linux).CXX_STD = CXX17 replaces -std=c++17 in
PKG_CXXFLAGS in both Makevars and
Makevars.win.-fno-builtin-printf added to GGML_CFLAGS
to suppress printf → puts optimizations.edge_install_cuda,
edge_install_cuda_toolkit, edge_reload_cuda,
edge_cuda_info.Flash attention support: Enabled by default in
edge_load_model() via flash_attn = TRUE.
Reduces memory usage and improves attention computation speed on
CPU.
Full hardware thread utilization: Removed the
4-thread cap for small contexts. edge_load_model() now uses
all available CPU threads by default, with n_threads_batch
set to max for prompt processing.
User-configurable threading: New
n_threads parameter in edge_load_model()
allows explicit control over CPU thread count. Pass NULL
(default) for auto-detect or an integer to limit cores.
Apple Accelerate framework (macOS): Automatically links the Accelerate framework on macOS builds, enabling hardware-accelerated vDSP vector operations for faster matrix math.
Compiler auto-vectorization: Added
-ftree-vectorize to GGML compilation flags on all
platforms, allowing GCC/Clang to generate SIMD instructions for eligible
loops beyond the hand-tuned GGML kernels.
SIMD-optimized build system: Replaced generic
scalar fallback with architecture-aware SIMD detection in both
Makevars (Unix) and Makevars.win (Windows)
User-configurable SIMD levels: Set
EDGEMODELR_SIMD environment variable before install to
select optimization level:
GENERIC: Scalar fallback (maximum compatibility)SSE42: SSE4.2 baseline (default on x86_64)AVX: AVX + F16C (Intel Sandy Bridge 2011+)AVX2: AVX2 + FMA + F16C (Intel Haswell 2013+,
recommended)AVX512: AVX-512 (Intel Skylake-X 2017+)NATIVE: Uses -march=native for maximum
performance on the build machineedge_simd_info(): New function to
query compile-time SIMD status including architecture, compiler
features, and GGML optimization flags
x86 architecture-specific quantization: Enabled
optimized x86 quantization kernels (arch/x86/quants.c,
arch/x86/repack.cpp) with SIMD-accelerated dot products and
matrix operations
Fixed donttest examples: Changed
resource-intensive examples from \donttest{} to
\dontrun{} to prevent downloading multi-GB models during
CRAN checks
Fixed M1 Mac compiler warnings: Added explicit
static_cast<> for:
double to float conversions for
temperature/top_p parameterssize_type to int32_t conversions for
buffer size parametersFixed connection handling: Replaced
on.exit() with tryCatch/finally for proper
connection cleanup in loops (thanks @eddelbuettel)
edge_small_model_config() function provides optimized
settings for small models (1B-3B parameters)
edge_find_ollama_models() - Discover all locally
available Ollama models across platforms (Windows, macOS, Linux)edge_load_ollama_model() - Load Ollama models using
convenient SHA-256 hash prefixes instead of full file pathstest_ollama_model_compatibility() - Built-in
compatibility testing for Ollama modelsstd::filesystem on
macOS builds<mach-o/dyld.h> inclusion with direct function
declarations to avoid enum conflicts-march=native, -mtune=native, etc.)
from Makevars for CRAN compatibilityedge_clean_cache() functionedge_load_model() - Load GGUF model files for
inferenceedge_completion() - Generate text completionsedge_stream_completion() - Stream text generation with
real-time callbacksedge_chat_stream() - Interactive chat session with
streaming responsesedge_free_model() - Memory management and cleanupis_valid_model() - Model context validationedge_list_models() - List pre-configured popular
modelsedge_download_model() - Download models from Hugging
Face Hubedge_quick_setup() - One-line model download and
setupThis release provides a complete, production-ready solution for Local Large Language Model Inference Engine in R, enabling private, offline text generation workflows.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.