> whisper
Transcribe audio to text with OpenAI Whisper. Use when a user asks to transcribe audio files, generate subtitles (SRT/VTT), transcribe podcasts, convert speech to text, translate audio to English, build transcription pipelines, do speaker diarization, transcribe meetings, process voice memos, create searchable audio archives, or integrate speech-to-text into applications. Covers OpenAI Whisper (local), Whisper API, faster-whisper, whisper.cpp, and production deployment patterns.
curl "https://skillshub.wtf/TerminalSkills/skills/whisper?format=md"Whisper
Overview
Transcribe audio with OpenAI's Whisper — the state-of-the-art speech recognition model. This skill covers local Whisper (Python), faster-whisper (CTranslate2, 4x faster), whisper.cpp (CPU-optimized C++), and the OpenAI Whisper API. Includes subtitle generation (SRT/VTT/JSON), multi-language transcription, translation to English, speaker diarization, word-level timestamps, and production pipeline patterns for podcasts, meetings, and video subtitles.
Instructions
Step 1: Choose Your Runtime
Option A — OpenAI Whisper (original Python):
pip install openai-whisper
# Models: tiny (39M), base (74M), small (244M), medium (769M), large-v3 (1.5G)
Option B — faster-whisper (recommended for local, 4x faster):
pip install faster-whisper
# Uses CTranslate2 — INT8 quantization, runs well on CPU
Option C — whisper.cpp (best for CPU, minimal dependencies):
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && make
# Download model
bash models/download-ggml-model.sh base.en
Option D — OpenAI API (no local GPU needed):
pip install openai
export OPENAI_API_KEY="sk-..."
Step 2: Basic Transcription
faster-whisper (recommended):
from faster_whisper import WhisperModel
model = WhisperModel("base", device="cpu", compute_type="int8")
# GPU: model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe("episode.mp3", beam_size=5)
print(f"Language: {info.language} (prob: {info.language_probability:.2f})")
for segment in segments:
print(f"[{segment.start:.2f}s → {segment.end:.2f}s] {segment.text}")
OpenAI Whisper (original):
import whisper
model = whisper.load_model("base") # tiny, base, small, medium, large-v3
result = model.transcribe("episode.mp3")
print(result["text"])
for segment in result["segments"]:
print(f"[{segment['start']:.1f}s - {segment['end']:.1f}s] {segment['text']}")
whisper.cpp (CLI):
./main -m models/ggml-base.en.bin -f episode.wav -otxt -osrt -ovtt
# Outputs: episode.txt, episode.srt, episode.vtt
OpenAI API:
from openai import OpenAI
client = OpenAI()
with open("episode.mp3", "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f,
response_format="verbose_json",
timestamp_granularities=["segment", "word"],
)
print(transcript.text)
for segment in transcript.segments:
print(f"[{segment.start:.1f}s] {segment.text}")
Step 3: Subtitle Generation (SRT/VTT)
Generate SRT with faster-whisper:
from faster_whisper import WhisperModel
model = WhisperModel("small", device="cpu", compute_type="int8")
segments, info = model.transcribe("episode.mp3", beam_size=5)
def format_timestamp(seconds):
h = int(seconds // 3600)
m = int((seconds % 3600) // 60)
s = int(seconds % 60)
ms = int((seconds % 1) * 1000)
return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
with open("episode.srt", "w") as f:
for i, seg in enumerate(segments, 1):
f.write(f"{i}\n")
f.write(f"{format_timestamp(seg.start)} --> {format_timestamp(seg.end)}\n")
f.write(f"{seg.text.strip()}\n\n")
print("Generated episode.srt")
Generate VTT (for HTML5 video):
with open("episode.vtt", "w") as f:
f.write("WEBVTT\n\n")
for i, seg in enumerate(segments, 1):
start = format_timestamp(seg.start).replace(",", ".")
end = format_timestamp(seg.end).replace(",", ".")
f.write(f"{start} --> {end}\n")
f.write(f"{seg.text.strip()}\n\n")
Word-level timestamps (for karaoke-style subtitles):
segments, info = model.transcribe("episode.mp3", word_timestamps=True)
for segment in segments:
for word in segment.words:
print(f" [{word.start:.2f}s → {word.end:.2f}s] {word.word}")
Step 4: Language Detection & Translation
# Auto-detect language
segments, info = model.transcribe("foreign_audio.mp3")
print(f"Detected: {info.language} ({info.language_probability:.0%})")
# Force specific language
segments, info = model.transcribe("german.mp3", language="de")
# Translate to English (any language → English)
segments, info = model.transcribe("german.mp3", task="translate")
for seg in segments:
print(seg.text) # English translation
Supported languages: 99 languages including en, zh, de, es, ru, ko, fr, ja, pt, tr, pl, nl, ar, sv, it, hi, and many more.
Step 5: Speaker Diarization
Combine Whisper with pyannote-audio for "who said what":
from faster_whisper import WhisperModel
from pyannote.audio import Pipeline
import torch
# Transcribe
model = WhisperModel("small", device="cpu", compute_type="int8")
segments, info = model.transcribe("meeting.wav", beam_size=5)
whisper_segments = list(segments)
# Diarize (requires HuggingFace token with pyannote access)
diarization = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="hf_YOUR_TOKEN"
)
diarization_result = diarization("meeting.wav")
# Merge: assign speaker to each whisper segment
def get_speaker(start, end, diarization_result):
"""Find the dominant speaker during a time range."""
speakers = {}
for turn, _, speaker in diarization_result.itertracks(yield_label=True):
overlap_start = max(start, turn.start)
overlap_end = min(end, turn.end)
if overlap_start < overlap_end:
duration = overlap_end - overlap_start
speakers[speaker] = speakers.get(speaker, 0) + duration
return max(speakers, key=speakers.get) if speakers else "Unknown"
for seg in whisper_segments:
speaker = get_speaker(seg.start, seg.end, diarization_result)
print(f"[{speaker}] [{seg.start:.1f}s → {seg.end:.1f}s] {seg.text}")
Step 6: Batch Processing & Pipelines
Transcribe all episodes in a directory:
import os
from pathlib import Path
from faster_whisper import WhisperModel
model = WhisperModel("small", device="cpu", compute_type="int8")
episodes_dir = Path("episodes")
output_dir = Path("transcripts")
output_dir.mkdir(exist_ok=True)
for audio_file in sorted(episodes_dir.glob("*.mp3")):
print(f"Transcribing: {audio_file.name}")
segments, info = model.transcribe(str(audio_file), beam_size=5)
# Plain text
txt_path = output_dir / f"{audio_file.stem}.txt"
with open(txt_path, "w") as f:
for seg in segments:
f.write(seg.text.strip() + "\n")
# SRT
srt_path = output_dir / f"{audio_file.stem}.srt"
segments2, _ = model.transcribe(str(audio_file), beam_size=5) # Re-iterate
with open(srt_path, "w") as f:
for i, seg in enumerate(segments2, 1):
h, m = divmod(int(seg.start), 3600)
m, s = divmod(m, 60)
ms = int((seg.start % 1) * 1000)
start_ts = f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
h2, m2 = divmod(int(seg.end), 3600)
m2, s2 = divmod(m2, 60)
ms2 = int((seg.end % 1) * 1000)
end_ts = f"{h2:02d}:{m2:02d}:{s2:02d},{ms2:03d}"
f.write(f"{i}\n{start_ts} --> {end_ts}\n{seg.text.strip()}\n\n")
print(f" → {txt_path}, {srt_path}")
Step 7: Model Selection Guide
| Model | Size | VRAM | Speed (CPU) | Accuracy | Best For |
|---|---|---|---|---|---|
| tiny | 39M | ~1GB | Very fast | Low | Quick drafts, real-time |
| base | 74M | ~1GB | Fast | Medium | Good balance for CPU |
| small | 244M | ~2GB | Moderate | Good | Podcasts, clear audio |
| medium | 769M | ~5GB | Slow | Very good | Noisy audio, accents |
| large-v3 | 1.5G | ~10GB | Very slow | Best | Production quality |
Recommendations:
- CPU-only, speed matters:
tinyorbasewith faster-whisper - CPU-only, accuracy matters:
smallwith faster-whisper - GPU available:
large-v3with faster-whisper (float16) - No local compute: OpenAI API (
whisper-1)
Examples
Example 1: Transcribe a podcast season and generate SRT subtitles
User prompt: "Transcribe all 20 MP3 episodes in ./episodes/ and generate both plain text transcripts and SRT subtitle files. I have a GPU with 10GB VRAM."
The agent will:
- Install faster-whisper via
pip install faster-whisper. - Load the
large-v3model withdevice="cuda"andcompute_type="float16"to leverage the GPU. - Create a
./transcripts/output directory. - Loop over all
.mp3files in./episodes/, transcribing each withbeam_size=5. - Write both a
.txtfile (plain text) and a.srtfile (with properly formatted timestamps) for each episode. - Report the detected language and total processing time per episode.
Example 2: Translate a German interview to English with speaker labels
User prompt: "I have a 45-minute German interview recording at meeting.wav with two speakers. Transcribe it in English and label who said what."
The agent will:
- Install faster-whisper and pyannote-audio (
pip install faster-whisper pyannote.audio). - Transcribe
meeting.wavwithtask="translate"to get English text from the German audio. - Run pyannote speaker diarization to identify speaker segments (requires a HuggingFace token with pyannote model access).
- Merge Whisper segments with diarization results by matching time ranges to assign speaker labels.
- Output a formatted transcript with
[Speaker 1]and[Speaker 2]labels before each segment.
Guidelines
- Use faster-whisper over the original OpenAI Whisper for local transcription; it is 4x faster and uses less memory through INT8 quantization via CTranslate2.
- Select model size based on your hardware:
tiny/basefor CPU speed,smallfor CPU accuracy,large-v3for GPU production quality. - Always convert audio to WAV or ensure ffmpeg is installed when working with MP3/M4A inputs; Whisper relies on ffmpeg for non-WAV format decoding.
- For speaker diarization, the pyannote pipeline requires a HuggingFace access token with accepted model terms; set this up before attempting multi-speaker transcription.
- When generating SRT files, use
beam_size=5for more accurate segment boundaries; the default greedy decoding can produce poorly timed subtitle breaks.
> related_skills --same-repo
> zustand
You are an expert in Zustand, the small, fast, and scalable state management library for React. You help developers manage global state without boilerplate using Zustand's hook-based stores, selectors for performance, middleware (persist, devtools, immer), computed values, and async actions — replacing Redux complexity with a simple, un-opinionated API in under 1KB.
> zoho
Integrate and automate Zoho products. Use when a user asks to work with Zoho CRM, Zoho Books, Zoho Desk, Zoho Projects, Zoho Mail, or Zoho Creator, build custom integrations via Zoho APIs, automate workflows with Deluge scripting, sync data between Zoho apps and external systems, manage leads and deals, automate invoicing, build custom Zoho Creator apps, set up webhooks, or manage Zoho organization settings. Covers Zoho CRM, Books, Desk, Projects, Creator, and cross-product integrations.
> zod
You are an expert in Zod, the TypeScript-first schema declaration and validation library. You help developers define schemas that validate data at runtime AND infer TypeScript types at compile time — eliminating the need to write types and validators separately. Used for API input validation, form validation, environment variables, config files, and any data boundary.
> zipkin
Deploy and configure Zipkin for distributed tracing and request flow visualization. Use when a user needs to set up trace collection, instrument Java/Spring or other services with Zipkin, analyze service dependencies, or configure storage backends for trace data.