Agent Playbook, build tutorial

How to Build an AI Agent for Invoice Processing

A complete, technical build for an accounts payable agent that reads every invoice, matches it to the purchase order and goods receipt, applies your rules, and posts it to your accounting system, escalating to a human only on exceptions. This covers the architecture, real code for each stage, three-way matching, exception handling, security, the numbers, and how to roll it out without it breaking.

Why manual and rule-based AP breaks

Accounts payable looks simple and is not. Every vendor formats invoices differently, totals and taxes need checking, each one must be matched to a purchase order and what actually arrived, approvals get chased over email, and only then is it keyed into the accounting system. The result is slow and error prone. Industry benchmarks put manual processing at roughly $15 to $40 per invoice (Ardent Partners), while top-quartile automated teams run around $10 or below (APQC), and manual AP teams capture only 20 to 30 percent of available early-payment discounts (Ardent Partners).

Template and RPA tools help until a vendor changes their layout, then they break. An agent is different: it reads any layout, checks the numbers against your records, decides under rules you set, and asks a human only when something genuinely does not add up.

Agent vs fixed pipeline: when each is the honest choice

An agent follows an observe, decide, act loop: it reads the invoice, checks it against your data, decides what to do under your policy, and takes the action (post, or escalate). A fixed pipeline runs the same steps every time with no branching. Think of it as eyes (OCR), a brain (the model plus your rules), and hands (the ERP and email tools it can call).

Be honest about which you need. If your invoices are uniform and your rules are simple, a straight pipeline is cheaper and easier to trust. The agent earns its extra complexity when invoices vary a lot, exceptions are common, and the right next step genuinely depends on what the document says. Most real AP sits in that second case, which is why this guide builds an agent with a deterministic matching core.

The architecture

Every invoice flows through one loop driven by a state machine: ingested, extracted, validated, matched, then either auto-approved and posted, or sent to a human as an exception. Corrections feed back into the agent's memory, and guardrails wrap the whole thing.

1 to 2. Ingest and classify inbox watch, dedup, sort 3. Extract OCR + LLM to schema 4. Validate and match three-way: invoice, PO, receipt 5. Decision policy within rules? 7. Post to ERP via API, with audit trail 6. Human review exceptions only 8. Learn and monitor retrain on corrections Guardrails dedup, fraud audit log accuracy checks yes, auto-approve no feedback loop

Before you build: prerequisites

Inputs

Digital and scanned PDFs, images, multi-page invoices, and credit and debit notes. Know your messiest formats up front, they decide your OCR choice.

Systems to connect

Your accounting or ERP (Tally, Zoho, QuickBooks, NetSuite, SAP), the PO and goods-receipt source, and the vendor master.

Governance

Approval rules and limits, an audit log, data retention and PII handling, and separation of duties between who approves and who pays.

The stack, and why each piece

Layer Pick Why
Intake trigger Gmail / Outlook push API, or a poller Fire on every new invoice email. Push is real-time; polling every few minutes is simpler to run.
OCR / parse Google Document AI, AWS Textract, or LlamaParse Turns scans and PDFs into text plus layout. Tesseract is fine for clean digital files only.
Extraction An LLM with structured output (GPT, Claude, Gemini) Reads any layout and returns typed fields. This is what survives vendor format drift that breaks templates.
Orchestration LangGraph or a small state machine Drives the invoice through states (ingested, extracted, matched, approved, posted) and the exception branch.
Matching + state Your ERP API (Tally, Zoho, SAP) + Postgres Pulls POs and receipts to compare against, and stores runs, decisions, and corrections.
Review UI A web app, a sheet, or Slack approval buttons Where a human clears exceptions. Start with Slack buttons; graduate to a console as volume grows.

Build it, step by step

1

Ingest reliably

Subscribe to the accounts-payable inbox and fire the pipeline on every new email that has an attachment. The Gmail and Outlook push APIs give you real-time events; a poller that checks every few minutes is the simpler fallback. Normalise everything to a job the moment it arrives so the handler returns fast and the heavy work happens in the queue. Uploads and vendor-portal exports feed the exact same queue, so there is one path to maintain. Capture the raw file and a content hash up front: the hash is what later stops the same invoice being processed twice.

# Watch the AP inbox; enqueue each attachment, return fast
@gmail.on_message(label="ap", has_attachment=True)
def on_invoice(msg):
    for att in msg.attachments:           # pdf, scan, or image
        raw = att.download()
        jobs.enqueue("process_invoice", {
            "file": raw,
            "hash": sha256(raw),          # used for dedup later
            "source": msg.sender,
        })
2

Classify the document

Not everything in an AP inbox is an invoice. Before extraction, a quick classification step sorts invoices from credit notes, statements, reminders, and spam, and routes non-invoices out. This one cheap step keeps your extraction accurate and your costs down, because you are not running a full extraction on a marketing email. Tag the vendor here too, so later steps can apply vendor-specific rules and memory.

kind = llm.classify(ocr_preview, labels=[
    "invoice", "credit_note", "statement", "reminder", "other",
])
if kind != "invoice":
    route_elsewhere(kind)                 # not our job, stop here
3

Extract into a strict schema

A PDF or photo is just an image, so OCR first converts it to text and layout (Document AI, Textract, or LlamaParse). Then an LLM maps that text into a typed schema, not free prose. Forcing structured output against a schema is what makes the result usable downstream and is the single biggest quality lever. Every field carries a confidence; anything low-confidence is flagged rather than trusted. Define the schema once and reuse it everywhere.

from pydantic import BaseModel, Field
from datetime import date

class LineItem(BaseModel):
    description: str
    quantity: float
    unit_price: float
    amount: float

class Invoice(BaseModel):
    vendor: str
    invoice_no: str
    invoice_date: date
    po_number: str | None = Field(None, description="PO number if present")
    currency: str = "INR"
    tax: float = 0
    total: float
    line_items: list[LineItem]

# Force the model to fill exactly this shape
def extract(ocr_text: str) -> Invoice:
    return llm.parse(ocr_text, response_format=Invoice)
4

Validate and three-way match

Now check what you read against your own records. Re-add the line items and compare to the stated total to catch arithmetic and OCR errors. Look up the PO and the goods receipt by PO number through your ERP API (or a nightly export) and compare totals and quantities within a tolerance you set. Verify the vendor against your master, and confirm this invoice number has not already been paid. When invoice, PO, and receipt agree, that is the three-way match, and it is deterministic comparison code, so it is testable and auditable, unlike asking the model to decide.

def three_way_match(inv: Invoice, po, grn, tolerance=100) -> dict:
    return {
        "math":     abs(sum(li.amount for li in inv.line_items) - inv.total) < 1,
        "po_total": abs(inv.total - po.total) <= tolerance,
        "receipt":  all(grn.qty[li.description] >= li.quantity
                        for li in inv.line_items),
        "vendor":   vendor_master.is_approved(inv.vendor),
        "not_dup":  not ledger.exists(inv.vendor, inv.invoice_no),
    }
5

Apply the decision policy

Turn business rules into explicit conditions you own. Clean, in-policy invoices auto-approve; everything else routes to a human with the exact failing check named, so the reviewer knows why in one glance. The model never decides on its own here, it applies your policy, and every decision is logged with the rule that fired. Keep the thresholds in config, not code, so finance can change them without a deploy.

def decide(inv: Invoice, checks: dict) -> tuple[str, list]:
    failed = [name for name, ok in checks.items() if not ok]
    if not failed and inv.total < policy.auto_approve_limit:
        return "auto_approve", []
    return "exception", failed            # e.g. ["po_total"] -> PO mismatch
6

Handle exceptions with a human in the loop

Exceptions are a first-class feature, not an afterthought, because they are where money is saved or lost. A flagged invoice lands on a review surface (a web console, a sheet, or a Slack message with buttons) showing the invoice, the extracted fields, and the specific failing check. The reviewer approves or corrects in seconds. Two things make this strong: a retry budget that re-extracts with a stricter prompt before bothering a human, and storing every correction keyed to the vendor so the agent stops repeating that mistake. Low-confidence fields and unseen vendors should default to review until they have a track record.

def on_exception(inv, failed):
    if inv.retries < 2 and "math" in failed:
        return reextract(inv, stricter=True)   # try again before a human
    review_queue.add(inv, reason=failed,
                     fields=inv.dict(), confidence=inv.confidence)

# Learn from the fix so it does not recur for this vendor
on_correction(lambda fix: memory.save(inv.vendor, fix))
7

Post to the ERP and reconcile

On approval, write the bill through your accounting tool's API and attach the source PDF and the decision log, so finance can defend every entry. Wrap the agent's actions as typed tools (here, a LangChain StructuredTool) so the orchestrator can call them safely with validated arguments. Where a tool has no API, emit a validated import file instead of re-keying. After posting, reconcile the payment back against the invoice so the loop closes.

from langchain_core.tools import StructuredTool

post_bill = StructuredTool.from_function(
    name="post_bill_to_erp",
    description="Create a bill in the ERP after approval; returns bill id.",
    args_schema=Invoice,
    func=lambda inv: erp.create_bill(            # Tally / Zoho / QuickBooks
        vendor=inv.vendor, total=inv.total,
        line_items=inv.line_items, attachment=inv.source_file,
    ),
)
8

Add memory, monitoring, and guardrails

Production is the hard part. Block duplicates on vendor plus invoice number plus the content hash before anything posts. Keep an immutable, append-only audit log of every step, who or what did it, and when, so finance and auditors can trust it. Track extraction accuracy, auto-approval rate, and exception rate on a dashboard, and alert when accuracy slips. Watch for fraud signals such as a vendor's bank details changing between invoices. Feed corrections back so accuracy climbs over time. This is the difference between a weekend demo and something a finance team will sign off on.

if ledger.exists(inv.vendor, inv.invoice_no) or ledger.seen(inv.hash):
    return flag("duplicate")              # never pay the same bill twice
if vendor_master.bank_changed(inv.vendor, inv.bank_account):
    return flag("bank_detail_change")     # classic fraud vector -> hold

audit.append(run)                          # tamper-proof, who/what/when
metrics.track(extraction_accuracy, auto_approve_rate, exception_rate)
alert_if(extraction_accuracy < 0.95)

The review interface

The human only sees exceptions. The console shows the invoice, the extracted fields, the match result, and one decision to make.

Invoice review queue EXCEPTIONS (3) Acme CorpPO mismatch Globexamount over limit Initechno PO found INVOICE EXTRACTED FIELDS VendorAcme Corp Invoice #INV-4471 AmountRs 1,24,500 PO #PO-2289 Match: PO total differs by Rs 4,500 Approve Flag

Build, buy, or no-code

Approach Best for Effort Cost Control
Custom agent (this guide) You need control, your own rules, and tight ERP fit High Build cost + LLM usage Full
IDP platform (Nanonets, Rossum) Standard AP, fast start, less engineering Low Per-page or seat subscription Medium
No-code (n8n, Make) Low volume, a quick pilot, simple rules Low Cheap, but brittle at scale Low

Security, compliance, and fraud

AP touches money, so trust is the product. Run the pipeline in your own cloud or on-premise if SOC-2 or GDPR requires it, with the OCR and model self-hosted where needed. Enforce separation of duties so the system that approves is not the one that pays, and keep an immutable audit log of every action for auditors.

Two fraud vectors matter most. Duplicate payments: blocked by the vendor, invoice-number, and content-hash check before posting. Changed bank details: any account that differs from the vendor master is held for human confirmation, since redirecting payments to a new account is the classic invoice fraud.

The numbers

Benchmarks vary by source and should be treated as ranges, not promises. These are from neutral industry bodies, not vendor marketing.

Metric Manual Automated / agent
Cost per invoice ~$15 to $40 (Ardent Partners) Top-quartile teams run ~$10 or below (APQC)
Invoices per person / year ~4,200 (IOFM average) ~6,900 at top performers (IOFM)
Early-payment discounts captured 20 to 30% (Ardent Partners) Higher, because nothing is paid late by accident
Touchless rate Low; most invoices are keyed by hand Climbs as the auto-approve threshold proves out

Sources: Ardent Partners AP Metrics That Matter; APQC AP benchmarking; IOFM. Figures are industry ranges.

What makes it fail

  • !Trusting low-confidence extractions instead of routing them to review.
  • !No immutable audit trail, so finance cannot defend a posting.
  • !Template parsing that breaks when a vendor changes layout.
  • !No dedup or bank-change check, so it pays twice or pays a fraudster.
  • !A brittle ERP integration that silently drops postings under load.

A safe rollout

  1. Pilot on one high-volume vendor format, with a human approving every invoice.
  2. Turn on auto-approve for clean, in-policy invoices once accuracy proves out; keep exceptions manual.
  3. Widen format by format, raise the auto-approve limit, and add monitoring and alerts.
  4. Scale to all vendors and entities on one engine, each with its own rules.

FAQs

General FAQs

Everything you need to know about the service and how it works. Can’t find an answer? Mail us at info@galific.com

  • What accuracy can I expect, and how does scan quality affect it? βŒ„
    Extraction is highly accurate on clean digital PDFs and drops on poor scans and photos, which is exactly why every field carries a confidence score and low-confidence fields route to a human instead of posting. Accuracy also climbs over time because the agent learns from each correction. Treat any single accuracy percentage you see in vendor marketing as best-case for clean inputs.
  • Do I need three-way matching for every invoice, or is two-way enough? βŒ„
    Two-way (invoice against PO) is enough for services and where you do not track goods receipts. Three-way (invoice, PO, and goods receipt) is the standard for physical goods because it confirms you are paying for what actually arrived. The agent supports both; you set which applies per category, with a tolerance for small differences.
  • How does the agent handle an invoice from a brand-new vendor or a new format? βŒ„
    Because extraction is model-based rather than template-based, a new layout still gets read; it does not break the way a template tool would. New vendors default to human review until they have a short track record, after which clean ones can auto-approve under your policy.
  • How does it detect and prevent duplicate payments? βŒ„
    Every file is hashed on intake, and before posting the agent checks vendor plus invoice number plus that hash against the ledger. If any match, it flags a duplicate and holds it. This catches both the same PDF sent twice and the same invoice resubmitted in a different format.
  • How do you prevent fraud such as changed bank details or fake invoices? βŒ„
    The agent compares the bank details on each invoice against the vendor master and holds anything that changed for explicit human confirmation, since bank-detail swaps are a common fraud. Combined with vendor validation, approval thresholds, and an immutable audit log, it closes the usual gaps.
  • Can this run on-premise for data security and compliance? βŒ„
    Yes. The pipeline can run in your cloud or on-premise, with the OCR and the model self-hosted where required for SOC-2 or GDPR. The audit log, access controls, and data retention are part of the build, not an add-on.
  • How does it integrate with QuickBooks, Xero, NetSuite, SAP, or Tally? βŒ„
    Through each tool's API, which all of these expose. The posting step is built against your specific system and sends fields in the format it expects, with the PDF and audit trail attached. Where a system has no API, the agent emits a validated import file.
  • What is the difference between this and traditional OCR or RPA? βŒ„
    OCR only turns an image into text. RPA replays fixed clicks and breaks when a screen or layout changes. An agent reads any layout, validates against your records, decides under your rules, and escalates what it cannot resolve, which is the part that actually removes manual work.
  • LLM vs a dedicated IDP platform for extraction? βŒ„
    An LLM with structured output is flexible and handles format drift well, and is ideal when you want control and custom rules. A dedicated IDP platform can be faster to start for standard invoices. Many teams use the LLM for extraction and reasoning and keep a deterministic matching layer around it, which is the approach in this guide.
  • What does it cost to run, and what is the realistic ROI timeline? βŒ„
    Running cost is mostly LLM usage per invoice plus hosting, which is small next to the labour it removes. Industry benchmarks put manual processing at roughly $15 to $40 per invoice (Ardent Partners) versus around $10 or below for top-quartile automated teams (APQC). Payback usually shows within a few months once the high-volume vendor formats are live.
  • Should I build a custom agent or buy an off-the-shelf AP tool? βŒ„
    Buy if your AP is standard and you want speed with less engineering. Build (or have it built) when you need your own rules, tight ERP integration, and control of the data and IP. See the comparison table above; the honest answer depends on volume and how custom your process is.
  • How are low-confidence extractions and exceptions escalated? βŒ„
    Low-confidence fields and failed checks route to a review queue with the invoice, the extracted fields, and the exact reason flagged. The agent first spends a small retry budget (re-extracting with a stricter prompt) before involving a human, and every human correction is stored so the same issue does not recur.

Want this built and running, not just diagrammed?

Galific designs, builds, and runs agents like this, integrated with your ERP and your rules. Or explore the ready-made versions in our agent suite.