How to Analyze a Video File with the Gemini API
Upload an MP4 to the Gemini API with the File API and ask the model to describe scenes, find moments, and pull timestamps.
Gemini models can take a video file as input and reason about what happens in it, including objects, actions, and on-screen text, with timestamps. This guide uploads a local MP4 using the File API and asks the model questions about it from Python. It is the building block for clip search, highlight reels, and automated tagging.
What you need
- A Gemini API key from aistudio.google.com/apikey
- Python 3.9 or newer
- A short video file (start under ~50 MB to keep uploads quick)
- The google-genai SDK installed
Step 1: Install the SDK and set your key
Install the official SDK and export your API key as an environment variable so it is not hard-coded in the script.
Step 2: Upload the video with the File API
Videos go through the File API, which stores the file and returns a handle you pass to the model. After uploading you must poll until the file's state is ACTIVE, because Gemini processes the video before it can be used.
import os, time
from google import genai
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
video = client.files.upload(file="clip.mp4")
# Wait until processing finishes
while video.state.name == "PROCESSING":
time.sleep(5)
video = client.files.get(name=video.name)
if video.state.name == "FAILED":
raise RuntimeError("Video processing failed")Step 3: Ask a question about the video
Pass the uploaded file handle and a text prompt to generate_content. Ask for timestamps explicitly if you want them, because the model can reference points in the video by time.
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=[
video,
"List the key scenes in this video with a timestamp "
"(mm:ss) and a one-line description for each.",
],
)
print(response.text)Result
The script prints a scene-by-scene breakdown with timestamps, for example "[00:12] presenter opens the laptop, [00:34] slide shows the pricing table." From here you can swap the prompt to extract on-screen text, count occurrences of an object, or find the exact moment a logo appears.
Watch related tutorials
11:10
14:05
16:40
1:42:18
28:14
41:09