The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
serve(): a single-process, OpenAI-compatible HTTP
STT server (POST /v1/audio/transcriptions and
/translations, GET /health) built on base R
sockets, with no new dependencies. It loads the model once and keeps it
resident, so it drops in for the OpenAI API or a Whisper container;
point stt.api at it with set_stt_base().
Returns text, json, or
verbose_json (segment timestamps, plus per-word timestamps
when the request includes timestamp_granularities[]=word).
An example systemd unit ships in
system.file("whisper.service", package = "whisper").jit_compile’d TorchScript call instead of dozens of
dispatched R->torch calls, several times faster end-to-end and
token-for-token equivalent to the eager path. Covers both greedy and
word-timestamp decoding. On by default via the new jit
argument to transcribe()/whisper_pipeline();
pass jit = FALSE for the eager decoder. No effect on CPU or
beam search.openai-whisper: decoding suppresses non-speech tokens
(brackets, music notes, speaker tags) and control tokens at every step,
so output no longer contains
[BLANK_AUDIO]/[MUSIC PLAYING]-style
annotations; the seek loop decodes only the real audio
(content_frames), not the fixed 30s of mel padding, so a 7s
clip no longer trails off into hallucinated text up to 30s; and a
no-speech-probability gate skips windows that read as silence. The
special-token table gains sot_lm and
sot_prev.sample_len) rather than the full context, and the default
temperatures enable the existing compression-ratio
fallback, which re-decodes too-repetitive output at a higher
temperature.tokenizer_encode() crashing for models whose
vocab.json omits the <|endoftext|> key
(large-v3): the end-of-text id now comes from the special-token table
(as in the Python reference, which keeps special tokens out of the BPE
vocab), and the lookup can no longer return a list. A regression test
covers a vocab without the key, and encode_special()
resolves the core special tokens from the table too.whisper_dtype() now falls back to float32 on the GTX
16-series (TU116/TU117: GTX 1630/1650/1660 and Ti/Super variants), which
compute fp16 incorrectly and return NaN (seen as repeated “!” tokens).
Detection is by GPU name, CUDA-gated and tryCatch-guarded (dormant on
non-CUDA/CRAN machines); pass dtype = "float16" to
override.whisper_tune_gc(): opt-in helper that tunes torch’s
CUDA allocator GC rates for inference. No-op off CUDA, and only sets
options that are unset.torch::torch_scaled_dot_product_attention() instead of
reaching into torch’s namespace; the torch dependency is floored at
0.17.0, where it is exported.transcribe() now defaults to
language = NULL, which detects the spoken language from the
audio before decoding. New exported function
detect_language() for standalone language identification.
Breaking: previous default was
language = "en". Code relying on the default now
auto-detects instead of assuming English. Pass
language = "en" explicitly to restore old behavior.whisper_pipeline() for cached model reuse across
multiple transcriptionsadded_tokens.json download)transcribe_chunk()These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.