The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

edgemodelr 0.4.1

CRAN Resubmission Fixes

Stderr references in compiled objects (CRAN auto-check NOTE on Debian): the previous CRAN cleanup (commit d8870bd) added stdio suppression to 7 upstream files but missed ggml/ggml.c and ggml/ggml-opt.cpp. Both now include the same #ifdef USING_R macro block that neutralizes printf, fprintf, fputs, fflush, stderr, and stdout. These calls were diagnostic-only and were already silent at runtime via the installed log callback; now the symbols never reach the compiled object files either.

edgemodelr 0.4.0

Structured Output, Embeddings, RAG, and API Server

New Features

Grammar-constrained generation (edge_grammar_completion()): Force model output to conform to a GBNF grammar specification. Ensures valid, parseable structured output (JSON, enums, numbers, etc.) using llama.cpp’s native grammar sampler.
JSON schema helper (edge_json_grammar()): Convert a simple R list schema into a GBNF grammar string. Supports string, number, integer, boolean fields and enum (character vector) constraints.
Structured data extraction (edge_extract()): High-level function that combines prompt construction with grammar-constrained generation to extract structured data from text. Returns a parsed R list (requires jsonlite).
Text classification (edge_classify()): Classify text into predefined categories using grammar constraints. Supports single text and batch (vectorized) classification. Output is guaranteed to be one of the specified categories.
Text embeddings (edge_embeddings()): Extract dense vector embeddings from any loaded model. Returns a numeric matrix (n_texts x n_embd) suitable for clustering, semantic search, similarity computation, and RAG pipelines. Supports optional L2 normalization.
Cosine similarity (edge_similarity(), edge_similarity_matrix()): Compute pairwise cosine similarity between embedding vectors. Matrix version efficiently computes all-pairs similarity using normalized matrix multiply.
Embedding dimension query (edge_model_n_embd()): Query the embedding dimension of a loaded model.
Batch processing (edge_map()): Apply a prompt template over a vector of texts with progress reporting. Supports both string templates with {text} placeholder and custom prompt functions. Optional grammar constraint for structured batch output.
Batch extraction (edge_extract_batch()): Extract structured data from multiple texts, returning a data frame with one row per input.
RAG document indexing (edge_index_documents()): Build a semantic embedding index from a directory of text files or a character vector. Automatic chunking with configurable size and overlap.
RAG semantic search (edge_search()): Find the most relevant text chunks for a query using cosine similarity over the embedding index.
RAG question answering (edge_ask()): Retrieval-augmented generation that retrieves relevant context from an index and generates a grounded answer. Supports custom system prompts and optional context return for debugging/transparency.
Plumber API server (edge_serve()): Serve a model as a local OpenAI-compatible REST API. Endpoints: /v1/completions, /v1/chat/completions, /v1/embeddings, /v1/models, /health. Supports optional API key authentication and CORS. Requires plumber.
Qwen3 model family in edge_list_models(): Added Qwen3-0.6B, 1.7B, 4B, and 8B pre-configured entries from the unsloth GGUF repository.
Friendly names in edge_download_model(): Now accepts model names from edge_list_models() (e.g., edge_download_model("Qwen3-0.6B")) in addition to HuggingFace repo IDs. Filename is auto-resolved from the model registry.
httr download fallback: .robust_download() now tries httr::GET before R’s download.file, improving reliability on corporate networks with custom SSL certificates or proxy configurations.
SIMD optimization warning: On package load, warns if running without SIMD (generic mode) and suggests reinstalling from source with EDGEMODELR_SIMD=NATIVE for faster inference.

Bug Fixes

Fixed grammar-constrained generation failures (issue #41): edge_grammar_completion(), edge_extract(), and edge_extract_batch() were unusable due to two bugs. First, edge_json_grammar() emitted rule names like field_1 containing underscores, which llama.cpp’s grammar parser rejects (only [a-zA-Z0-9-] is allowed in rule identifiers). Renamed to field-1. Second, llama_sampler_accept() throws “Unexpected empty grammar stack” when a token fully satisfies the grammar; the binding now catches this and terminates cleanly, same as end-of-generation handling.
Fixed crash from silent context size override (issue #40 item 11): Removed the auto-reduction of n_ctx for small models that silently changed the user’s requested context size. This caused segfaults when prompts exceeded the reduced context. Context is now used as-is. Minimum n_ctx lowered from 512 to 128 for short-task use cases.
Fixed prompt echo in completion output (issue #40 item 1): edge_completion() previously returned prompt + generated_text. Now returns only the generated text, matching user expectations.
Added prompt length validation: All completion functions now validate that the tokenized prompt fits within the model’s context window before calling llama_decode(). Exceeding the context now raises a clear R error instead of crashing the process.
Model-native chat templates (issue #40 item 7): New edge_chat_completion() function reads the model’s chat template from GGUF metadata (via llama_chat_apply_template) and formats messages correctly for each model architecture (ChatML, Llama, Gemma, etc.). build_chat_prompt() updated to accept an optional ctx parameter for native template formatting, with ChatML as the generic fallback (replacing the old Human:/Assistant: format).

Use Cases Unlocked

Sentiment analysis: edge_classify(ctx, text, c("positive", "negative", "neutral"))
Entity extraction: edge_extract(ctx, text, list(name = "string", role = "string"))
Data labeling: Batch classify thousands of rows with guaranteed valid labels
Semantic search: Embed documents and queries, find nearest neighbors
Document clustering: Compute similarity matrices, feed to hclust/kmeans
RAG foundations: Embed corpus, retrieve relevant context for generation

edgemodelr 0.3.0

CUDA GPU Support and Qwen3 Tokenizer Fix

New Features

CUDA GPU acceleration (Windows): New edge_install_cuda() and edge_install_cuda_toolkit() functions set up GPU inference automatically.
- edge_install_cuda() downloads the matching ggml-cuda dynamic backend from llama.cpp releases and extracts the companion ggml-base.dll / ggml.dll runtime libraries.
- edge_install_cuda_toolkit() copies nvcudart_hybrid64.dll from the Windows DriverStore (already on any NVIDIA-driver machine, no download required) and fetches cublas64 / cublasLt64 from NVIDIA’s redistrib server.
- edge_reload_cuda() activates the CUDA backend in the current R session without restarting R.
- edge_cuda_info() reports whether CUDA is installed and active.
- Pass n_gpu_layers = -1L to edge_load_model() for full GPU offload.
- Tested on NVIDIA RTX 5070 Ti (Blackwell sm_120, CUDA 13.1, 12 GB VRAM): Qwen3-14B loads in 3.4 s with full VRAM offload.
Updated llama.cpp to build b8179 (GGML 0.9.7): Brings all upstream model architecture updates, sampler improvements, and quantization fixes.

Bug Fixes

Qwen3 / QWEN2 tokenizer 40-minute load time (8000× speedup): The QWEN2 byte-level regex pattern caused GCC’s std::regex to spend 40+ minutes in exponential backtracking. Added a hand-written fast path unicode_regex_split_custom_qwen2() in unicode.cpp, matching the logic of the existing llama-3 fast path. Qwen3-14B now loads in 0.3 s on CPU (3.4 s on GPU including VRAM transfer). Covers QWEN2 and QWEN3.5 variants.

CRAN Compliance

Replaced abort() in ggml_abort() with raise(SIGABRT) under #ifdef USING_R; replaces abort() token in ggml.cpp with std::terminate().
Guarded ggml_print_backtrace() body and fflush(stdout) / fprintf(stderr, …) in ggml_abort() with #ifndef USING_R to remove _Exit, stdout, and stderr symbol references from ggml.o on macOS.
Added #define _GNU_SOURCE to ggml-cpu.c (required for SCHED_BATCH, CPU_ZERO, pthread_setaffinity_np on Linux).
CXX_STD = CXX17 replaces -std=c++17 in PKG_CXXFLAGS in both Makevars and Makevars.win.
-fno-builtin-printf added to GGML_CFLAGS to suppress printf → puts optimizations.
Man pages added for edge_install_cuda, edge_install_cuda_toolkit, edge_reload_cuda, edge_cuda_info.

edgemodelr 0.2.0

SIMD Optimizations for Faster CPU Inference

New Features

Flash attention support: Enabled by default in edge_load_model() via flash_attn = TRUE. Reduces memory usage and improves attention computation speed on CPU.
Full hardware thread utilization: Removed the 4-thread cap for small contexts. edge_load_model() now uses all available CPU threads by default, with n_threads_batch set to max for prompt processing.
User-configurable threading: New n_threads parameter in edge_load_model() allows explicit control over CPU thread count. Pass NULL (default) for auto-detect or an integer to limit cores.
Apple Accelerate framework (macOS): Automatically links the Accelerate framework on macOS builds, enabling hardware-accelerated vDSP vector operations for faster matrix math.
Compiler auto-vectorization: Added -ftree-vectorize to GGML compilation flags on all platforms, allowing GCC/Clang to generate SIMD instructions for eligible loops beyond the hand-tuned GGML kernels.

Existing Features

SIMD-optimized build system: Replaced generic scalar fallback with architecture-aware SIMD detection in both Makevars (Unix) and Makevars.win (Windows)
- x86_64: Enables SSE4.2 baseline by default (universal since Intel Nehalem 2008)
- aarch64/arm64: NEON support built into the ABI (no extra flags needed)
- Other architectures: Automatic generic fallback
User-configurable SIMD levels: Set EDGEMODELR_SIMD environment variable before install to select optimization level:
- GENERIC: Scalar fallback (maximum compatibility)
- SSE42: SSE4.2 baseline (default on x86_64)
- AVX: AVX + F16C (Intel Sandy Bridge 2011+)
- AVX2: AVX2 + FMA + F16C (Intel Haswell 2013+, recommended)
- AVX512: AVX-512 (Intel Skylake-X 2017+)
- NATIVE: Uses -march=native for maximum performance on the build machine
edge_simd_info(): New function to query compile-time SIMD status including architecture, compiler features, and GGML optimization flags
x86 architecture-specific quantization: Enabled optimized x86 quantization kernels (arch/x86/quants.c, arch/x86/repack.cpp) with SIMD-accelerated dot products and matrix operations

Performance

15-40% faster inference on x86_64 with SSE4.2 baseline vs generic scalar
Up to 2-3x faster with AVX2 for quantized model operations
SSSE3-accelerated integer multiply-accumulate for quantized dot products

edgemodelr 0.1.5

CRAN Policy Fixes

Bug Fixes

Fixed donttest examples: Changed resource-intensive examples from \donttest{} to \dontrun{} to prevent downloading multi-GB models during CRAN checks
Fixed M1 Mac compiler warnings: Added explicit static_cast<> for:
- double to float conversions for temperature/top_p parameters
- size_type to int32_t conversions for buffer size parameters
Fixed connection handling: Replaced on.exit() with tryCatch/finally for proper connection cleanup in loops (thanks @eddelbuettel)

edgemodelr 0.1.4

Performance Optimizations for Small Language Models

New Features

Small Model Configuration Helper: New edge_small_model_config() function provides optimized settings for small models (1B-3B parameters)
- Device-specific presets: mobile, laptop, desktop, and server
- Adaptive configuration based on model size and available RAM
- Built-in performance tips and recommendations
- Automatic parameter tuning for optimal inference speed
Adaptive Batch Processing: Intelligent batch size optimization based on context length
- Small contexts (≤512): Uses up to full context for batching
- Medium contexts (512-2048): Uses 1/2 context for optimal throughput
- Large contexts (2048-4096): Uses 1/4 context to balance speed and memory
- Very large contexts (>4096): Caps at 2048 tokens for stability
Smart Thread Allocation: Context-aware CPU thread management
- Small models automatically limit threads to avoid overhead
- Reduces CPU contention on resource-constrained devices
- Improves inference speed for models with contexts ≤2048 tokens
Automatic Context Optimization: Model size-based context tuning
- Small models (<1GB): Optimized to 1024 tokens for faster inference
- Medium models (1-2GB): Set to 1536 tokens for balanced performance
- Large models (>2GB): Maintains 2048+ tokens for quality
- User override available via n_ctx parameter

Performance Improvements

Faster Small Model Inference: 15-30% speed improvement for small models through optimized batch and thread settings
Reduced Memory Footprint: Better memory efficiency for resource-constrained environments
Lower Latency: Optimized thread allocation reduces context switching overhead
Better Scalability: Adaptive configurations scale from mobile devices to servers

Examples and Documentation

Small Model Optimization Example: Comprehensive example demonstrating all optimization features
- Configuration comparison across device types
- Performance benchmarking workflow
- Best practices for different model sizes
- Manual tuning guidelines
Enhanced Testing: New test suite for small model configuration
- Tests for all device target configurations
- Validation of adaptive parameter adjustments
- Safety checks for edge cases

Technical Details

Improved C++ bindings with adaptive batch size calculations
Enhanced R API with intelligent parameter defaults
Better integration between model size detection and configuration
Comprehensive documentation for optimization features

edgemodelr 0.1.2

Major New Features

Ollama Integration

Native Ollama Support: Complete integration with Ollama models through automatic model discovery and SHA-256 hash-based loading
edge_find_ollama_models() - Discover all locally available Ollama models across platforms (Windows, macOS, Linux)
edge_load_ollama_model() - Load Ollama models using convenient SHA-256 hash prefixes instead of full file paths
test_ollama_model_compatibility() - Built-in compatibility testing for Ollama models
Cross-platform Model Detection: Robust model discovery supporting standard installations, snap packages (Linux), and various Windows configurations
Windows OneDrive Compatibility: Enhanced path detection that properly handles Windows OneDrive document folder redirections

Comprehensive Examples Suite

Structured Learning Path: Complete examples directory with progressive difficulty levels (Beginner → Intermediate → Advanced)
01_basic_usage.R: Fundamental operations including model loading, text generation, parameter tuning, and error handling
02_ollama_integration.R: Complete Ollama workflow with model discovery, hash-based loading, and compatibility testing
03_streaming_generation.R: Real-time streaming text generation with interactive chat interfaces and callback processing
04_performance_optimization.R: Advanced performance tuning including GPU acceleration, benchmarking, memory management, and batch processing
examples/README.md: Comprehensive documentation with learning paths, troubleshooting guide, and customization instructions

Package Structure Improvements

Organized File Structure: Consolidated all examples into structured examples/ directory with consistent formatting
Enhanced Documentation: Improved inline documentation and example comments throughout

edgemodelr 0.1.1

Bug Fixes and Improvements

Compilation Fixes

macOS Boolean Conflicts: Completely resolved Boolean enum conflicts by avoiding problematic system headers and using direct function declarations
Filesystem Compatibility: Added comprehensive fallback implementation for disabled std::filesystem on macOS builds
Header Protection: Implemented robust cross-platform header inclusion strategy that works with R, Rcpp, and system headers
System Header Workarounds: Replaced <mach-o/dyld.h> inclusion with direct function declarations to avoid enum conflicts
Format Attribute Warnings: Suppressed unsupported printf format attribute warnings on macOS Apple Clang compiler
CRAN Compliance: Removed non-portable optimization flags (-march=native, -mtune=native, etc.) from Makevars for CRAN compatibility
Cross-platform Build: Enhanced Makevars configuration for better macOS compatibility with R package requirements

Demo and Documentation Updates

Modern UI: Updated streaming chat demo with modern bslib interface for enhanced user experience
Documentation: Improved documentation for edge_clean_cache() function
Examples: Enhanced streaming chat example with better UI components

Technical Improvements

Build System: Updated Makevars files for improved compilation on Windows and Unix systems
Core Bindings: Enhanced C++ bindings for better performance and stability

edgemodelr 0.1.0

Initial CRAN Release

New Features

Local LLM Inference: Complete R interface for running large language models locally using llama.cpp and GGUF model files
Model Management: Built-in functions for downloading and managing popular models from Hugging Face
Text Generation: Support for both blocking and streaming text completion
Interactive Chat: Real-time streaming chat interface with conversation history
Privacy-First: All processing happens locally without external API calls

Core Functions

edge_load_model() - Load GGUF model files for inference
edge_completion() - Generate text completions
edge_stream_completion() - Stream text generation with real-time callbacks
edge_chat_stream() - Interactive chat session with streaming responses
edge_free_model() - Memory management and cleanup
is_valid_model() - Model context validation

Model Management

edge_list_models() - List pre-configured popular models
edge_download_model() - Download models from Hugging Face Hub
edge_quick_setup() - One-line model download and setup

System Support

Self-contained: Includes complete llama.cpp implementation
Cross-platform: Works on Windows, macOS, and Linux
CPU optimized: Runs efficiently on standard hardware
Memory efficient: Support for quantized models

Documentation

Comprehensive getting started vignette
Complete API documentation with examples
README with extensive usage examples
Test coverage for all major functionality

Technical Implementation

C++17 integration via Rcpp
Real-time token streaming with callback support
Automatic memory management with RAII
Robust error handling and validation
Thread-safe model operations

This release provides a complete, production-ready solution for Local Large Language Model Inference Engine in R, enabling private, offline text generation workflows.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.