GeminiIntermediate

How to Extract Data from Images and Screenshots with Gemini

Hand Gemini a photo of a receipt, whiteboard, or table and get clean structured data back as JSON or a spreadsheet.

8 minIntermediate

Gemini can look at an image and read it, not just transcribe the text but understand the structure of a receipt, a handwritten note, or a screenshot of a table. This guide turns a photo into structured data you can drop into a database or sheet, both in the chat app and via the API.

What you need

  • A photo or screenshot containing the data (receipt, table, form)
  • A Gemini account, or an API key for the programmatic route
  • A target format in mind: JSON, CSV, or a table

Step 1: Attach the image and ask for structure

In the chat app, attach the image with the plus icon, then ask for the data in a specific shape. The key is naming the fields you want, so Gemini does not guess at the schema.

Gemini - receipt to JSON
You
[receipt.jpg] Extract merchant, date, total, and each line item as JSON.
Agent
{ "merchant": "Cafe Roma", "date": "2026-06-18", "total": 18.40, "items": [...] }
Attach the photo and request named fields.

Step 2: Enforce a strict JSON schema (API)

For automation you want JSON every time, not prose. The API supports a response schema that forces the model to return data in your exact structure, which removes the parsing headaches.

extract.py
import os
from google import genai
from google.genai import types

client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
img = client.files.upload(file="receipt.jpg")

schema = {
    "type": "object",
    "properties": {
        "merchant": {"type": "string"},
        "date": {"type": "string"},
        "total": {"type": "number"},
    },
    "required": ["merchant", "date", "total"],
}

resp = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[img, "Extract the receipt details."],
    config=types.GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=schema,
    ),
)
print(resp.text)
Terminal - clean JSON out
$ python extract.py
{
"merchant": "Cafe Roma",
"date": "2026-06-18",
"total": 18.4
}
Schema mode guarantees parseable output.

Step 3: Handle multi-row tables

For a screenshot of a table, ask for an array of row objects and name every column. Tell Gemini to leave a field empty rather than invent a value when a cell is blurry or cut off.

prompt.txt
Read this table screenshot. Return an array of objects with keys:
name, role, email. If a cell is unreadable, use an empty string,
never guess.
Always verify numbers
Vision models can misread similar digits, like a 3 versus an 8, especially on low-resolution or angled photos. For financial data, spot-check totals against the original image before trusting them.
Better photos, better results
Crop to just the relevant area, shoot flat-on with good light, and avoid shadows across the text. A clean image cuts extraction errors more than any prompt tweak.

Result

You get structured, parseable data from a plain photo. With response_schema in the API the output is reliable enough to feed straight into a spreadsheet, an invoice tool, or a database insert without manual cleanup.

Watch related tutorials

Tags
#vision#ocr#json#documents