Back to tools

Whisper.cpp Rust Bindings

Available

Rust bindings for whisper.cpp with stdin/PCM streaming support

open-source Rust
View on GitHub

Overview

Safe Rust bindings for the OperatorKit whisper.cpp PCM fork. The crate wraps whisper.cpp inference, adds ergonomic streaming types, and exposes raw PCM streaming for agents or services that already have normalized audio bytes.

Highlights

  • PCM streaming - WhisperStreamPcm reads raw PCM from any Rust Read source.
  • Thread-safe context - share WhisperContext with workers while each transcription owns its state.
  • VAD options - use built-in fixed-step processing or Silero-backed segmentation.
  • Enhanced transcription - optional VAD aggregation and temperature fallback for difficult audio.

Best fit

Use this when your Rust app already handles capture, transport, or decoding and needs local Whisper inference without shelling out. For command-line piping and direct stdin processing, use the underlying PCM fork binary.

use whisper_cpp_plus::WhisperContext;

let ctx = WhisperContext::new("models/ggml-base.en.bin")?;
let audio: Vec<f32> = load_audio("audio.wav");
let text = ctx.transcribe(&audio)?;
println!("{}", text);

Quick Start

use whisper_cpp_plus::WhisperContext;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let ctx = WhisperContext::new("models/ggml-base.en.bin")?;

    // Audio must be 16kHz mono f32
    let audio: Vec<f32> = load_audio("audio.wav");
    let text = ctx.transcribe(&audio)?;
    println!("{}", text);

    Ok(())
}

Features

  • Thread-safeWhisperContext is Send + Sync, share via Arc
  • Streaming — real-time transcription via WhisperStream and WhisperStreamPcm
  • VAD — simple energy VAD by default plus Silero Voice Activity Detection integration
  • Enhanced VAD — segment aggregation for optimal transcription chunks
  • Temperature fallback — quality-based retry with multiple temperatures
  • Asynctokio::spawn_blocking wrappers (feature = async)
  • Cross-platform — Windows (MSVC), Linux, macOS (Intel & Apple Silicon)
  • Quantization — model compression via WhisperQuantize (feature = quantization)
  • Hardware acceleration — SIMD auto-detected, GPU via feature flags

Installation

[dependencies]
whisper-cpp-plus = "0.1.5"

# Optional
hound = "3.5"  # WAV file loading

System Requirements

  • Rust 1.70.0+
  • CMake 3.14+
  • C++ compiler (MSVC on Windows, GCC/Clang on Linux/macOS)

Feature Flags

whisper-cpp-plus = { version = "0.1.5", features = ["quantization"] }  # Model quantization
whisper-cpp-plus = { version = "0.1.5", features = ["async"] }         # Async API
whisper-cpp-plus = { version = "0.1.5", features = ["cuda"] }          # NVIDIA GPU
whisper-cpp-plus = { version = "0.1.5", features = ["metal"] }         # macOS GPU

CUDA GPU Acceleration

Install the CUDA Toolkit and build:

cargo build --features cuda

The build script uses CMake to compile whisper.cpp with CUDA support automatically. The CUDA toolkit is located via CUDA_PATHCUDA_HOME → standard install paths.

Advanced: prebuilt libraries — for CI or to skip recompilation, set WHISPER_PREBUILT_PATH to a directory containing pre-compiled static libs. See docs/CACHING_GUIDE.md.

API Overview

Core Types

TypeDescriptionwhisper.cpp equivalent
WhisperContextModel context (Send + Sync)whisper_context*
WhisperStateTranscription state (Send only)whisper_state*
FullParamsTranscription parameterswhisper_full_params
TranscriptionResultText + timestamped segments
WhisperStreamChunked real-time streaming
WhisperStreamPcmStreaming from raw PCM inputstream-pcm.cpp
WhisperVadProcessorSilero voice activity detectionwhisper_vad_*
EnhancedWhisperVadProcessorVAD + segment aggregation
EnhancedWhisperStateTranscription with temperature fallback
WhisperQuantizeModel quantization (feature)quantize.cpp

Examples

Transcription with parameters:

use whisper_cpp_plus::{WhisperContext, TranscriptionParams};

let ctx = WhisperContext::new("model.bin")?;
let params = TranscriptionParams::builder()
    .language("en")
    .temperature(0.0)
    .enable_timestamps()
    .n_threads(4)
    .build();

let result = ctx.transcribe_with_params(&audio, params)?;
for segment in &result.segments {
    println!("[{:.2}s - {:.2}s] {}",
        segment.start_seconds(), segment.end_seconds(), segment.text);
}

Concurrent transcription:

use std::sync::Arc;
let ctx = Arc::new(WhisperContext::new("model.bin")?);

// Each thread gets its own WhisperState internally
let handles: Vec<_> = files.iter().map(|file| {
    let ctx = Arc::clone(&ctx);
    std::thread::spawn(move || ctx.transcribe(&load_audio(file)))
}).collect();

Streaming:

use whisper_cpp_plus::{WhisperStream, FullParams, SamplingStrategy};

let ctx = WhisperContext::new("model.bin")?;
let params = FullParams::new(SamplingStrategy::Greedy { best_of: 1 });
let mut stream = WhisperStream::new(&ctx, params)?;

loop {
    let chunk = get_audio_chunk(); // your audio source
    stream.feed_audio(&chunk);
    let segments = stream.process_pending()?;
    for seg in &segments {
        println!("{}", seg.text);
    }
}

PCM streaming (WhisperStreamPcm):

use whisper_cpp_plus::{
    FullParams, PcmFormat, PcmReader, PcmReaderConfig, SamplingStrategy, WhisperContext,
    WhisperStreamPcm, WhisperStreamPcmConfig, WhisperVadProcessor,
};

let ctx = WhisperContext::new("model.bin")?;
let vad = WhisperVadProcessor::new("models/ggml-silero-vad.bin")?;
let params = FullParams::new(SamplingStrategy::Greedy { best_of: 1 });

let config = WhisperStreamPcmConfig {
    use_vad: true,
    vad_thold: 0.6,
    vad_silence_ms: 800,
    vad_pre_roll_ms: 300,
    length_ms: 10000,
    ..Default::default()
};

// The source must already yield raw PCM bytes matching PcmReaderConfig.
// For this config that means 16 kHz mono f32 little-endian PCM.
let source = std::fs::File::open("audio_f32_16khz_mono.pcm")?;
let reader = PcmReader::new(
    Box::new(source),
    PcmReaderConfig {
        buffer_len_ms: 10000,
        sample_rate: 16000,
        format: PcmFormat::F32,
    },
);

let mut stream = WhisperStreamPcm::with_vad(&ctx, params, config, reader, vad)?;

stream.run(|segments, _start_ms, _end_ms| {
    for seg in segments {
        println!("{}", seg.text.trim());
    }
})?;

Notes:

  • PcmReader does not decode WAV/MP3, resample audio, or convert stereo to mono. Your Read source must already be normalized to the format described by PcmReaderConfig.
  • WhisperStreamPcm::new(...) uses fixed-step mode or simple built-in VAD depending on use_vad.
  • WhisperStreamPcm::with_vad(...) uses an explicit WhisperVadProcessor (Silero VAD) and is the recommended path when you want Silero-based segmentation.
  • In VAD mode, no_context is forced internally to match stream-pcm.cpp.
  • In VAD mode, run() emits the next completed speech chunk in chronological order, and callers can usually append those segments directly.
  • In fixed-step mode, callbacks are produced from overlapping windows, so callers that build a cumulative transcript may need to reconcile repeated text across callbacks.

VAD preprocessing:

use whisper_cpp_plus::{WhisperVadProcessor, VadParams};

let mut vad = WhisperVadProcessor::new("models/ggml-silero-vad.bin")?;
let params = VadParams::default();
let segments = vad.segments_from_samples(&audio, &params)?;

for (start, end) in segments.get_all_segments() {
    let start_sample = (start * 16000.0) as usize;
    let end_sample = (end * 16000.0) as usize;
    let text = ctx.transcribe(&audio[start_sample..end_sample])?;
    println!("[{:.1}s-{:.1}s] {}", start, end, text);
}

Enhanced VAD with segment aggregation:

use whisper_cpp_plus::enhanced::{EnhancedWhisperVadProcessor, EnhancedVadParams};

let mut vad = EnhancedWhisperVadProcessor::new("models/ggml-silero-vad.bin")?;
let params = EnhancedVadParams::default();
let chunks = vad.process_with_aggregation(&audio, &params)?;

for chunk in &chunks {
    let text = ctx.transcribe(&chunk.audio)?;
    println!("[{:.1}s, {:.1}s long] {}", chunk.offset_seconds, chunk.duration_seconds, text);
}

Temperature fallback for difficult audio:

let params = TranscriptionParams::builder()
    .language("en")
    .build();
let result = ctx.transcribe_with_params_enhanced(&audio, params)?;
// Automatically retries with higher temperatures if quality thresholds aren't met

More examples in whisper-cpp-plus/examples/.

Enhanced Features

Beyond standard whisper.cpp bindings, this crate provides optimizations inspired by faster-whisper:

Intelligent VAD Preprocessing

EnhancedWhisperVadProcessor aggregates Silero VAD speech segments into optimal-sized chunks for transcription. Instead of transcribing hundreds of tiny segments, it merges adjacent speech into configurable windows — 2-3x faster on audio with significant silence.

Temperature Fallback

EnhancedWhisperState automatically retries transcription at higher temperatures when quality thresholds aren’t met (compression ratio, log probability, no-speech probability). Handles noisy/difficult audio without manual intervention.

Both features are orthogonal — use one, both, or neither. See docs/ARCHITECTURE.md for design details.

Models

Downloading

The easiest way to get test models:

cargo xtask test-setup

This downloads ggml-tiny.en.bin and the Silero VAD model into whisper-cpp-plus-sys/whisper.cpp/models/ using whisper.cpp’s own download scripts.

For production models, download from Hugging Face:

curl -L -o models/ggml-base.en.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin

Available Models

ModelSizeEnglish-onlyMultilingual
tiny39 MBtiny.entiny
base142 MBbase.enbase
small466 MBsmall.ensmall
medium1.5 GBmedium.enmedium
large-v33.1 GBlarge-v3

Safety

Thread Safety

  • WhisperContext: Send + Sync — share via Arc
  • WhisperState: Send only — one per thread
  • FullParams: not Send/Sync — create per transcription

Memory Safety

All unsafe FFI operations encapsulated with null pointer checks, lifetime enforcement, and RAII cleanup.

Troubleshooting

“Failed to load model” — check file path, permissions, available memory

“Invalid audio format” — must be 16kHz mono f32, normalized to [-1, 1]

Linking errors on Windows — install Visual Studio Build Tools 2022, ensure x64 MSVC toolchain. See docs/TECHNICAL_REFERENCE.md.