IntegrationsIntermediate

How to Auto-Generate Captions with Whisper and Upload an SRT

Transcribe your video locally with Whisper, produce a clean SRT, then attach it to the video through the captions endpoint.

11 minIntermediate

YouTube auto-captions are hit or miss, especially with names and jargon. This guide transcribes your audio with OpenAI Whisper to get a clean SRT, then uploads that caption track to your video so viewers get accurate subtitles.

What you need

ffmpeg installed (Whisper uses it to read media)
Python 3.9+ and the openai-whisper package
Your client_secret.json and the video id to caption

Step 1: Install Whisper and ffmpeg

zsh — captions

$brew install ffmpeg

$pip install openai-whisper

Step 2: Transcribe to SRT

The Whisper command line writes several formats. The srt output is what YouTube wants. The small model is a good balance of speed and accuracy for spoken English.

zsh — captions

$whisper final.mp4 --model small --output_format srt

[00:00.000 --> 00:04.120] Welcome back to the channel.

Writing SRT: final.srt

final.srt — caption file

Explorer

final.mp4

final.srt

200:00:00,000 --> 00:00:04,120

3Welcome back to the channel.

600:00:04,120 --> 00:00:08,400

7Today we are fixing a wobbly chair.

Whisper produces numbered cues with start and end times.

Step 3: Upload the caption track

Use the captions.insert endpoint of the YouTube Data API, passing the video id and the SRT as the media body. Reuse the service() helper from the upload guide for authentication.

add_captions.py

from googleapiclient.http import MediaFileUpload
from upload import service  # reuse the auth helper

def add_caption(video_id, srt_path, language="en"):
    body = {
        "snippet": {
            "videoId": video_id,
            "language": language,
            "name": "English (Whisper)",
            "isDraft": False,
        }
    }
    media = MediaFileUpload(srt_path, mimetype="application/octet-stream")
    res = service().captions().insert(part="snippet", body=body, media_body=media).execute()
    print("Caption track id:", res["id"])

if __name__ == "__main__":
    add_caption("dQw4w9WgXcQ", "final.srt")

Proofread the proper nouns

Whisper is strong but still mangles brand names and people. Skim the SRT and fix any names before uploading, since captions also feed search indexing.

Result

The video now shows an English (Whisper) caption track in the CC menu. Run the same flow per language by translating the SRT and changing the language code.