How to Extract Structured Data from PDFs with AI in n8n
Read a PDF, pass its text to an LLM with an output schema, and turn unstructured invoices or forms into clean rows in a spreadsheet.
Invoices, receipts, and forms arrive as PDFs that humans have to retype. With n8n you can read the text, ask an LLM to pull the fields you care about as structured JSON, and append the result to a sheet. The Information Extractor node makes the schema part painless.
What you need
- A running n8n instance with an OpenAI credential
- A sample PDF such as an invoice (text-based, not a scanned image)
- A Google Sheets or database destination
Step 1: Get the PDF into the workflow
Use a trigger that brings in the file: a Gmail Trigger with attachments, a webhook with a binary upload, or a Read Files from Disk node for local testing. The PDF should land as binary data on the item.
Step 2: Convert the PDF to text
Add the Extract from File node set to Extract from PDF. Point it at the binary property (usually data) and it returns the document's text, which is what the model will read.
Step 3: Pull fields with the Information Extractor
Add an Information Extractor node and connect an OpenAI Chat Model to it. Define the attributes you want. The node enforces the schema so each run returns the same fields in the same shape.
invoice_number string The invoice or document number
vendor_name string The company that issued the invoice
total_amount number The grand total including tax
due_date string ISO date the payment is due
Text to analyze: {{ $json.text }}Step 4: Write the rows to a sheet
Add a Google Sheets node in Append mode and map each extracted attribute to a column. Run the workflow on your sample, then check that the values landed in the right cells.
Result
PDFs now flow in and structured rows come out: invoice number, vendor, total, and due date, ready for accounting. Point the trigger at a real inbox and the retyping disappears.
Watch related tutorials
32:08
21:45
34:10
26:40
32:15
40:20