GeminiIntermediate

How to Analyze a Video File with the Gemini API

Upload an MP4 to the Gemini API with the File API and ask the model to describe scenes, find moments, and pull timestamps.

11 minIntermediate

Gemini models can take a video file as input and reason about what happens in it, including objects, actions, and on-screen text, with timestamps. This guide uploads a local MP4 using the File API and asks the model questions about it from Python. It is the building block for clip search, highlight reels, and automated tagging.

What you need

A Gemini API key from aistudio.google.com/apikey
Python 3.9 or newer
A short video file (start under ~50 MB to keep uploads quick)
The google-genai SDK installed

Step 1: Install the SDK and set your key

Install the official SDK and export your API key as an environment variable so it is not hard-coded in the script.

zsh - project

$pip install google-genai

Successfully installed google-genai

$export GEMINI_API_KEY="your_key_here"

Step 2: Upload the video with the File API

Videos go through the File API, which stores the file and returns a handle you pass to the model. After uploading you must poll until the file's state is ACTIVE, because Gemini processes the video before it can be used.

analyze_video.py

import os, time
from google import genai

client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])

video = client.files.upload(file="clip.mp4")

# Wait until processing finishes
while video.state.name == "PROCESSING":
    time.sleep(5)
    video = client.files.get(name=video.name)

if video.state.name == "FAILED":
    raise RuntimeError("Video processing failed")

Terminal - upload progress

$ python analyze_video.py

uploading clip.mp4 ... done

state: PROCESSING

state: ACTIVE

Poll the file state until it flips from PROCESSING to ACTIVE.

Step 3: Ask a question about the video

Pass the uploaded file handle and a text prompt to generate_content. Ask for timestamps explicitly if you want them, because the model can reference points in the video by time.

analyze_video.py

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        video,
        "List the key scenes in this video with a timestamp "
        "(mm:ss) and a one-line description for each.",
    ],
)
print(response.text)

Sampling rate matters

Gemini samples video at roughly one frame per second by default, so fast on-screen text or quick cuts can be missed. For dense footage, slice the video into shorter segments and analyze each one.

Heads up

Files uploaded via the File API are deleted automatically after 48 hours. Re-upload if you come back to a script later, and do not rely on the file handle persisting.

Result

The script prints a scene-by-scene breakdown with timestamps, for example "[00:12] presenter opens the laptop, [00:34] slide shows the pricing table." From here you can swap the prompt to extract on-screen text, count occurrences of an object, or find the exact moment a logo appears.