AutomationIntermediate

How to Extract Structured Data from PDFs with AI in n8n

Read a PDF, pass its text to an LLM with an output schema, and turn unstructured invoices or forms into clean rows in a spreadsheet.

9 minIntermediate

Invoices, receipts, and forms arrive as PDFs that humans have to retype. With n8n you can read the text, ask an LLM to pull the fields you care about as structured JSON, and append the result to a sheet. The Information Extractor node makes the schema part painless.

What you need

  • A running n8n instance with an OpenAI credential
  • A sample PDF such as an invoice (text-based, not a scanned image)
  • A Google Sheets or database destination

Step 1: Get the PDF into the workflow

Use a trigger that brings in the file: a Gmail Trigger with attachments, a webhook with a binary upload, or a Read Files from Disk node for local testing. The PDF should land as binary data on the item.

Step 2: Convert the PDF to text

Add the Extract from File node set to Extract from PDF. Point it at the binary property (usually data) and it returns the document's text, which is what the model will read.

n8n - Extract from File
Node: Extract from File
Operation Extract from PDF
Input Binary data
Output text
[ Execute step ] -> 1,842 characters extracted
Converting the PDF binary into plain text for the model.

Step 3: Pull fields with the Information Extractor

Add an Information Extractor node and connect an OpenAI Chat Model to it. Define the attributes you want. The node enforces the schema so each run returns the same fields in the same shape.

Information Extractor - attributes
invoice_number   string   The invoice or document number
vendor_name      string   The company that issued the invoice
total_amount     number   The grand total including tax
due_date         string   ISO date the payment is due
Text to analyze: {{ $json.text }}
Scanned PDFs need OCR first
Extract from File only reads embedded text. If your PDF is a scanned image, run it through an OCR step or an OCR-capable API before the extractor, or you will get an empty string.

Step 4: Write the rows to a sheet

Add a Google Sheets node in Append mode and map each extracted attribute to a column. Run the workflow on your sample, then check that the values landed in the right cells.

Result

PDFs now flow in and structured rows come out: invoice number, vendor, total, and due date, ready for accounting. Point the trigger at a real inbox and the retyping disappears.

Watch related tutorials

Tags
#n8n#pdf#extraction#structured-output