AI & AutomationOCR & Auto-categorization

OCR & Auto-categorization

SOPHIOS uses a 3-stage AI pipeline to extract, validate, and score invoice data automatically. Upload a PDF or image of an invoice and the system does the rest — no manual data entry required.

The 3-Stage Pipeline

Stage 1: Extract

The AI reads your invoice document and extracts structured data:

  • Vendor name and contact information
  • Invoice number and date
  • Total amount, subtotal, and tax
  • Line items with descriptions, quantities, and unit prices
  • Expense category (operations, maintenance, fuel, catering, etc.)
  • Currency

Stage 2: Validate

Extracted data passes through automated validation checks:

  • Zod schema validation — ensures all required fields are present and correctly typed
  • Mathematical verification — confirms subtotal + tax = total, line items sum correctly
  • Anomaly detection — flags unusual amounts, new vendors, or missing required fields
  • Duplicate detection — checks for matching vendor + invoice number, or similar amounts and dates

Stage 3: Score

Each extraction receives a confidence score and quality assessment:

  • Confidence score (0-100%) — how certain the AI is about the extracted data
  • Quality assessment — HIGH, MEDIUM, or LOW based on document clarity and extraction accuracy

Confidence Levels

After processing, each invoice is assigned a confidence level:

LevelScoreIndicatorWhat It Means
HIGH90-100%GreenExtraction is reliable. Review and approve.
MEDIUM70-89%YellowMost fields extracted correctly. Some may need manual verification.
LOWBelow 70%RedManual review recommended. Document may be unclear or in an unusual format.
⚠️

Low confidence invoices are automatically set to NEEDS_REVIEW status. Always verify the extracted data before approving these invoices.


Supported Formats

FormatSupportedNotes
PDFYesRecommended format. Best results with text-based PDFs.
JPEGYesGood for scanned invoices and photos.
PNGYesGood for screenshots and scanned documents.

File size: Maximum 10MB. Recommended under 5MB for faster processing.

Processing time: Typically 10-30 seconds depending on document complexity and file size.


Auto-categorization

The AI automatically categorizes invoices based on:

  • Vendor name — known vendors are matched to their usual expense categories
  • Description keywords — line item descriptions are analyzed for category signals
  • Historical patterns — previous invoices from the same vendor inform categorization

Expense Type Classification

Each invoice is classified into one of three expense types:

  • OPEX (Operational Expenditure) — day-to-day running costs like fuel, provisions, marina fees. This is the default classification.
  • CAPEX (Capital Expenditure) — significant investments like equipment purchases, refits, upgrades.
  • MIXED — invoices containing both operational and capital items.

Vendor Matching

When an invoice is processed, SOPHIOS automatically:

  1. Normalizes the vendor name (handles variations in spelling, abbreviations)
  2. Searches existing vendor records for a match
  3. Links to the existing vendor if found, or creates a new vendor record
  4. Updates vendor statistics (invoice count, total spend)

Invoice Status After OCR

Depending on the confidence score, invoices are routed to different statuses:

  • PROCESSING — extraction is in progress
  • VERIFIED — high confidence extraction, ready for approval
  • NEEDS_REVIEW — low or medium confidence, requires manual verification

From there, invoices follow the standard approval workflow: VERIFIED > APPROVED > EXECUTED.


Best Practices for Upload Quality

  • Use 300 DPI resolution or higher for scanned documents
  • Upload in PDF format when possible
  • Ensure text is clear and legible
  • Keep documents upright (not rotated or skewed)
  • Use good lighting for photos of paper invoices

Privacy

All OCR processing runs on SOPHIOS private infrastructure. Invoice images and extracted data are never sent to external AI providers. See Private AI Infrastructure for details.


Related Pages: