OCR & Auto-categorization
SOPHIOS uses a 3-stage AI pipeline to extract, validate, and score invoice data automatically. Upload a PDF or image of an invoice and the system does the rest — no manual data entry required.
The 3-Stage Pipeline
Stage 1: Extract
The AI reads your invoice document and extracts structured data:
- Vendor name and contact information
- Invoice number and date
- Total amount, subtotal, and tax
- Line items with descriptions, quantities, and unit prices
- Expense category (operations, maintenance, fuel, catering, etc.)
- Currency
Stage 2: Validate
Extracted data passes through automated validation checks:
- Zod schema validation — ensures all required fields are present and correctly typed
- Mathematical verification — confirms subtotal + tax = total, line items sum correctly
- Anomaly detection — flags unusual amounts, new vendors, or missing required fields
- Duplicate detection — checks for matching vendor + invoice number, or similar amounts and dates
Stage 3: Score
Each extraction receives a confidence score and quality assessment:
- Confidence score (0-100%) — how certain the AI is about the extracted data
- Quality assessment — HIGH, MEDIUM, or LOW based on document clarity and extraction accuracy
Confidence Levels
After processing, each invoice is assigned a confidence level:
| Level | Score | Indicator | What It Means |
|---|---|---|---|
| HIGH | 90-100% | Green | Extraction is reliable. Review and approve. |
| MEDIUM | 70-89% | Yellow | Most fields extracted correctly. Some may need manual verification. |
| LOW | Below 70% | Red | Manual review recommended. Document may be unclear or in an unusual format. |
Low confidence invoices are automatically set to NEEDS_REVIEW status. Always verify the extracted data before approving these invoices.
Supported Formats
| Format | Supported | Notes |
|---|---|---|
| Yes | Recommended format. Best results with text-based PDFs. | |
| JPEG | Yes | Good for scanned invoices and photos. |
| PNG | Yes | Good for screenshots and scanned documents. |
File size: Maximum 10MB. Recommended under 5MB for faster processing.
Processing time: Typically 10-30 seconds depending on document complexity and file size.
Auto-categorization
The AI automatically categorizes invoices based on:
- Vendor name — known vendors are matched to their usual expense categories
- Description keywords — line item descriptions are analyzed for category signals
- Historical patterns — previous invoices from the same vendor inform categorization
Expense Type Classification
Each invoice is classified into one of three expense types:
- OPEX (Operational Expenditure) — day-to-day running costs like fuel, provisions, marina fees. This is the default classification.
- CAPEX (Capital Expenditure) — significant investments like equipment purchases, refits, upgrades.
- MIXED — invoices containing both operational and capital items.
Vendor Matching
When an invoice is processed, SOPHIOS automatically:
- Normalizes the vendor name (handles variations in spelling, abbreviations)
- Searches existing vendor records for a match
- Links to the existing vendor if found, or creates a new vendor record
- Updates vendor statistics (invoice count, total spend)
Invoice Status After OCR
Depending on the confidence score, invoices are routed to different statuses:
- PROCESSING — extraction is in progress
- VERIFIED — high confidence extraction, ready for approval
- NEEDS_REVIEW — low or medium confidence, requires manual verification
From there, invoices follow the standard approval workflow: VERIFIED > APPROVED > EXECUTED.
Best Practices for Upload Quality
- Use 300 DPI resolution or higher for scanned documents
- Upload in PDF format when possible
- Ensure text is clear and legible
- Keep documents upright (not rotated or skewed)
- Use good lighting for photos of paper invoices
Privacy
All OCR processing runs on SOPHIOS private infrastructure. Invoice images and extracted data are never sent to external AI providers. See Private AI Infrastructure for details.
Related Pages:
- Invoice Management — Full invoice workflow guide
- AI Chat Assistant — Ask questions about your processed invoices
- Private AI Infrastructure — How your data stays secure