The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
Native R torch implementation of OpenAI Whisper for speech-to-text transcription.
install.packages("whisper")
torch::install_torch() # one-time: downloads the C++ backendOr install the development version from GitHub:
remotes::install_github("cornball-ai/whisper")
torch::install_torch(){whisper} uses {torch} for inference. After
installing the R package, install_torch() downloads the
compiled C++ libraries (Lantern). You only need to run it once.
littler/r2u users: If install_torch()
fails with a permissions error, torch was installed to the system
library. Use sudo for the one-time Lantern download:
sudo r -e 'torch::install_torch()'library(whisper)
# Transcribe the bundled JFK "Ask not" speech (prompts to download model on first use)
jfk <- system.file("audio", "jfk.mp3", package = "whisper")
result <- transcribe(jfk)
result$text
#> "Ask not what your country can do for you, ask what you can do for your country."On first use, you’ll be prompted to download the model:
Download 'tiny' model (~151 MB) from HuggingFace? (Yes/no/cancel)
# Download a model explicitly
download_whisper_model("tiny")
# List available models
list_whisper_models()
#> [1] "tiny" "base" "small" "medium" "large-v3"
# Check which models are downloaded
list_downloaded_models()
# Check if a specific model exists locally
model_exists("tiny")# Basic transcription
result <- transcribe("audio.wav")
print(result$text)
# Specify model size
result <- transcribe("audio.wav", model = "small")
# Force CPU (useful if CUDA has issues)
result <- transcribe("audio.wav", device = "cpu")
# Non-English audio (specify language for better accuracy)
allende <- system.file("audio", "allende.mp3", package = "whisper")
result <- transcribe(allende, language = "es")
# Translate to English (quality is model-dependent; larger models work better)
result <- transcribe(allende, task = "translate", language = "es", model = "small")# Segment-level timestamps
result <- transcribe("audio.wav", timestamps = TRUE)
result$segments
#> start end text
#> 1 0.00 7.44 Ask not what your country...
# Word-level timestamps (via cross-attention DTW alignment)
result <- transcribe("audio.wav", word_timestamps = TRUE)
result$words
#> word start end
#> 1 Ask 0.00 0.54
#> 2 not 0.54 1.16
#> 3 what 1.16 2.46
#> ...Both work with the pipeline API for repeated transcription:
pipe <- whisper_pipeline("tiny")
result <- pipe$transcribe("audio.wav", word_timestamps = TRUE)
result$wordsserve() runs an OpenAI-compatible STT server that loads
the model once and keeps it resident, so a client (or {stt.api})
can transcribe over HTTP:
whisper::serve(model = "small", port = 7809L)curl -X POST http://localhost:7809/v1/audio/transcriptions \
-F file=@speech.wav -F response_format=verbose_jsonEndpoints: GET /health,
POST /v1/audio/transcriptions,
POST /v1/audio/translations. It’s built on base R sockets
(no extra dependencies). A systemd unit ships in
system.file("whisper.service", package = "whisper").
jit = TRUE, the default
on CUDA) instead of dozens of per-op R calls, several times faster than
the eager path and equivalent token-for-token. Covers greedy and
word-timestamp runs; beam search and CPU use the eager decoder.!).
whisper_dtype() auto-falls back to float32 on those cards;
pass dtype = "float16" to override.whisper_tune_gc() is an opt-in helper
that tunes torch’s CUDA allocator GC. It is largely inert for whisper
(whisper is dispatch-bound, not GC-bound — JIT is the lever); it’s kept
as cheap insurance.| Model | Parameters | Disk (fp32) | English WER | Peak VRAM (CUDA fp16) | Speed* |
|---|---|---|---|---|---|
| tiny | 39M | 151 MB | ~9% | 564 MiB | 1.1s |
| base | 74M | 290 MB | ~7% | 734 MiB | 1.1s |
| small | 244M | 967 MB | ~5% | 1,454 MiB | 1.2s |
| medium | 769M | 3.0 GB | ~4% | 3,580 MiB | 1.3s |
| large-v3 | 1550M | 6.2 GB | ~3% | 3,892 MiB | 2.7s |
*Speed is a warm transcribe of a 17s clip on an RTX 5060 Ti with
word_timestamps = TRUE (the heavier path; plain greedy is
several times faster, e.g. large-v3 ~1.4s), excluding one-time model
load and JIT compilation. Peak VRAM includes ~364 MiB torch CUDA context
overhead.
Models are downloaded from HuggingFace and cached in
~/.cache/huggingface/ unless otherwise specified.
MIT
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.