OCR & Document Processing

OCR (Optical Character Recognition) is an advanced feature of Heptora that allows you to automatically extract information from physical or digital documents. It converts printed, handwritten, or digital text into structured data that can be processed, validated, and used in your automations.

Digital Document Transformation

Heptora’s OCR system eliminates the need for manual data entry, reducing errors and dramatically accelerating document processing in your workflows.

Advantages of Integrated OCR

📄 Multiple Formats: Processes PDF, JPG/PNG/TIFF images, and scanned documents
🤖 Integrated AI: Automatic document type classification
🎯 Intelligent Extraction: Automatically identifies key fields without prior configuration
📊 Complex Structures: Recognizes tables, grids, and complex layouts
✓ Automatic Validation: Verifies formats of tax IDs, IBANs, dates, and other data
🌍 Multilingual: Support for multiple languages and special characters
📈 High Accuracy: Works with variable quality documents

Extraction Capabilities

Supported Formats

Heptora’s OCR can process a wide variety of input formats:

Digital Documents

Native PDFs: Digitally created PDF documents
Scanned PDFs: Physical documents converted to PDF
Hybrid documents: PDFs with both digital and scanned content

Images

JPG/JPEG: Document photographs
PNG: Screenshots and digital documents
TIFF: High-quality scanned documents
BMP: Bitmap images

Adaptive Quality

The system automatically adapts to different conditions:

Documents with variable resolution (from 150 DPI)
Images with uneven lighting
Documents with slight rotation or tilt
Texts with different font sizes
Documents with watermarks or stamps

Structured Extraction

Heptora’s OCR goes beyond simple text extraction, identifying document structure:

Text Fields

Headers: Main titles and sections
Paragraphs: Text blocks with semantic structure
Lists: Enumerated or bulleted items
Footnotes: References and annotations
Form fields: Data in predefined templates

Tables and Grids

The system recognizes and preserves tabular structure:

{
  "table_1": {
    "headers": ["Item", "Quantity", "Price", "Total"],
    "rows": [
      ["Product A", "10", "$25.00", "$250.00"],
      ["Product B", "5", "$40.00", "$200.00"]
    ],
    "total_rows": 2
  }
}

Graphic Elements

Identification of relevant non-textual elements:

Logos: Extraction and position of corporate images
Signatures: Detection of signed areas
Barcodes: Reading 1D and 2D codes
QR Codes: Extraction of encoded information
Stamps: Identification of official marks

Coordinates and Positioning

Each extracted element includes its exact location in the document:

{
  "field": "tax_id",
  "value": "12-3456789",
  "confidence": 0.98,
  "coordinates": {
    "x": 120,
    "y": 350,
    "width": 100,
    "height": 20,
    "page": 1
  }
}

This allows:

Verifying the expected position of critical fields
Detecting misplaced or missing fields
Creating visualizations of the extraction process
Validating document structure

Supported Document Types

Invoices

Complete extraction of commercial invoice information:

Issuer Data

Business name and trade name
Issuer tax ID
Complete fiscal address
Contact details (phone, email, website)

Recipient Data

Customer name or business name
Recipient tax ID
Billing address
Shipping address (if different)

Invoice Information

Invoice number
Invoice series
Issue date
Due date
Billing period

Items and Totals

Product/service descriptions
Quantities and units
Unit prices
Applied discounts
Tax base by tax rate
Itemized tax amounts
Withholdings (income tax, etc.)
Total invoice amount

Additional Information

Payment method
Bank details (IBAN)
Purchase order reference
Notes and observations

Contracts

Intelligent analysis of contractual documents:

Party Identification

Names of contracting parties
Legal representatives
Powers and authorities
Registered addresses

Main Clauses

Contract subject
Duration and validity
Automatic renewals
Termination conditions
Penalties

Economic Information

Price or consideration
Payment method and terms
Price reviews
Guarantees and bonds

Relevant Dates

Signature date
Effective start date
End date
Important milestones

Signatures and Annexes

Detection of signature areas
Identification of signatories
List of mentioned annexes
References to external documents

Forms

Automated processing of structured forms:

Field Types

Free text: Names, addresses, comments
Checkboxes: Checked/unchecked options
Radio buttons: Single selection among options
Dropdown lists: Selected values
Dates: In various formats (mm/dd/yyyy, etc.)
Signatures: Handwritten or digital

Field Validation

Required fields completed
Correct data format
Consistency between related fields
Detection of blank fields

Use Cases

Job applications
Registration forms
Surveys and questionnaires
Medical forms
Administrative declarations

Certificates

Data extraction from certified documents:

Academic Certificates

Issuing institution
Degree obtained
Grades
Issue date
Registration number

Professional Certificates

Certifying organization
Certification type
Level or category
Issue and expiration dates
Verification code

Official Certificates

Issuing entity
Certification subject
Beneficiary data
Validity
Official seals and signatures

Identity Documents

Secure extraction of personal identification data:

ID Cards / Driver’s Licenses

Document number
Full name
Date of birth
Nationality
Issue and expiration dates
Support number

Passports

Passport number
Document type
Issuing country
Personal data
MRZ (Machine Readable Zone)
Issue and expiration dates

Driver’s Licenses

License number
Authorized categories
Issue date
Expiration date
Restrictions

Receipts and Tickets

Processing of payment vouchers:

Purchase Tickets

Issuing merchant
Merchant tax ID
Purchase date and time
List of products/services
Individual prices
Applied discounts
Total paid
Payment method

Payment Receipts

Payment concept
Issuer and recipient
Amount
Payment date
Payment method
Receipt reference

Use Cases

Company expense management
Parking ticket control
Utility receipt processing
Payment reconciliation

Validation and Enrichment

Format Validation

The system includes specific validators for structured data:

Tax IDs / Business Numbers

Validation of check digit algorithm
Verification of correct format
Detection of impossible numbers
Identification of type (individual/business)

IBAN

Validation of country code
Verification of check digits
Format according to international standard
Correct length by country

Dates

Recognized formats: mm/dd/yyyy, dd-mm-yy, yyyy-mm-dd, etc.
Validation of impossible dates (February 31st, etc.)
Normalization to standard format
Detection of temporal inconsistencies

Amounts

Recognition of decimal separators (. or ,)
Detection of currency symbols ($, €, etc.)
Normalization to numeric format
Validation of expected ranges

Emails and URLs

Email format validation
URL structure verification
Domain detection

Inconsistency Detection

The system automatically identifies anomalies:

Mathematical Inconsistencies

{
  "error": "calculation_mismatch",
  "field": "total_invoice",
  "extracted_value": "$1,250.00",
  "calculated_value": "$1,235.50",
  "difference": "$14.50",
  "severity": "high"
}

Missing Data

Empty required fields
Incomplete sections
Missing pages (in multi-page documents)

Outliers

Amounts outside expected range
Future dates in historical documents
Duplicate data
Inconsistent formats

AI Enrichment

Artificial intelligence complements extraction with additional analysis:

Automatic Classification

The system identifies document type without prior configuration:

{
  "document_type": "invoice",
  "confidence": 0.95,
  "sub_type": "service_invoice",
  "detected_features": [
    "invoice_number",
    "tax_breakdown",
    "line_items",
    "company_header"
  ]
}

Semantic Extraction

Understands content meaning, not just text:

Named entities: People, organizations, locations
Relationships: Who invoices whom, who signs what
Intentions: Request, notification, certification
Sentiment: Document tone (for contracts and communications)

Categorization

Automatic document organization:

By document type
By supplier or customer
By responsible department
By date or period
By amount or relevance

Confidence Score per Field

Each extracted data includes a certainty level:

{
  "invoice_number": {
    "value": "INV-2024-00123",
    "confidence": 0.99,
    "status": "verified"
  },
  "invoice_date": {
    "value": "2024-03-15",
    "confidence": 0.95,
    "status": "verified"
  },
  "total_amount": {
    "value": "$1,250.00",
    "confidence": 0.72,
    "status": "review_required",
    "reason": "low_image_quality"
  }
}

Confidence Thresholds

0.95 - 1.00: Automatically verified
0.80 - 0.94: Accepted with validation
0.60 - 0.79: Review recommended
< 0.60: Review required

Assisted Review

Specialized interface for human validation of low-confidence data:

Original Document View

Visualization of source document
Highlighting of extracted fields
Zoom on problematic areas
Page navigation

Validation Panel

List of fields to review
Confidence indicator per field
Alternative suggestions
History of similar extractions

Quick Correction

Direct value editing
Selection among suggested options
Marking fields as correct
Indication of OCR errors

Workflow

System marks fields with confidence < 0.80
Sent to human review queue
User validates or corrects values
System learns from corrections
Validated data integrated into process

Process Integration

OCR Block in Builder

OCR integrates as a draggable block in the visual process designer:

Basic Configuration

Block: OCR Document Processing
Input: Document (file or URL)
Configuration:
  - Document type: Invoice
  - Language: English
  - Quality: High precision
Output: Structured data (JSON)

Flow Location

The OCR block can be placed at any point in the process:

[Receive Email] → [Download Attachment] → [OCR] → [Validate Data] → [Insert into ERP]

Visual Configuration

From the visual builder you can:

Select document type
Define required fields
Establish validation rules
Configure actions based on confidence
Define alternative flows for review

Zone Configuration

For documents with consistent layout, you can define specific zones:

Rectangular Zones

Define exact document areas:

{
  "zones": [
    {
      "name": "invoice_number",
      "coordinates": {
        "x": 450,
        "y": 100,
        "width": 150,
        "height": 30
      },
      "page": 1,
      "type": "text",
      "validation": "alphanumeric"
    },
    {
      "name": "total_amount",
      "coordinates": {
        "x": 450,
        "y": 650,
        "width": 100,
        "height": 25
      },
      "page": 1,
      "type": "currency",
      "validation": "positive_number"
    }
  ]
}

Relative Zones

Define areas relative to fixed elements:

{
  "zone": "client_name",
  "reference_text": "Customer:",
  "offset_x": 100,
  "offset_y": 0,
  "width": 300,
  "height": 20
}

Zone Advantages

Greater precision in structured documents
Shorter processing time
Reduction of false positives
Stricter validation

Document Templates

Predefined models to accelerate configuration:

Included Templates

Heptora includes templates for the most common documents:

Generic invoices: Standard US/international model
Electronic invoices: Standard electronic format
Delivery notes: Shipping documents
Purchase orders: Order documents
Employment contracts: Standard models
Identity documents: Various national IDs

Create Custom Templates

For organization-specific documents:

Upload sample documents (minimum 3-5 examples)
Label key fields in each example
Define specific validations
Test with new documents
Refine and publish the template

Use Templates

OCR Configuration:
  template: "vendor_invoice_xyz"
  fallback: "generic_invoice"
  confidence_threshold: 0.85

Structured Output

The OCR result is a complete JSON object:

{
  "document_id": "doc_20240315_123456",
  "processing_date": "2024-03-15T10:30:00Z",
  "document_type": "invoice",
  "confidence": 0.94,
  "pages": 1,
  "language": "en",

  "extracted_data": {
    "invoice_number": {
      "value": "INV-2024-00123",
      "confidence": 0.99,
      "coordinates": {"x": 450, "y": 100, "width": 150, "height": 30}
    },
    "invoice_date": {
      "value": "2024-03-15",
      "confidence": 0.97,
      "coordinates": {"x": 450, "y": 130, "width": 100, "height": 25}
    },
    "supplier": {
      "name": "Example Supplier Inc.",
      "tax_id": "12-3456789",
      "address": "123 Main Street, New York, NY 10001"
    },
    "customer": {
      "name": "My Company LLC",
      "tax_id": "98-7654321",
      "address": "456 Business Ave, Los Angeles, CA 90001"
    },
    "line_items": [
      {
        "description": "Product A",
        "quantity": 10,
        "unit_price": 25.00,
        "total": 250.00
      }
    ],
    "totals": {
      "subtotal": 250.00,
      "tax": 20.00,
      "total": 270.00,
      "currency": "USD"
    }
  },

  "validation": {
    "status": "validated",
    "errors": [],
    "warnings": ["Image quality could be improved"]
  },

  "metadata": {
    "file_name": "invoice_example.pdf",
    "file_size": 245678,
    "processing_time_ms": 2340
  }
}

Accessing the Data

In your process, access extracted data:

# Get OCR result
ocr_result = step_output["ocr_document"]

# Access specific fields
invoice_num = ocr_result["extracted_data"]["invoice_number"]["value"]
total = ocr_result["extracted_data"]["totals"]["total"]
supplier_tax_id = ocr_result["extracted_data"]["supplier"]["tax_id"]

# Check confidence
if ocr_result["confidence"] > 0.9:
    # Automatic processing
    process_automatically(ocr_result)
else:
    # Send to review
    send_to_review(ocr_result)

Post-processing

Transform and normalize extracted data:

Common Transformations

# Normalize tax IDs (remove spaces, hyphens)
tax_id_clean = normalize_tax_id(extracted_tax_id)

# Convert dates to ISO format
date_iso = convert_to_iso_date(extracted_date)

# Format amounts
amount_decimal = parse_currency(extracted_amount)

# Validate and format IBAN
iban_formatted = validate_and_format_iban(extracted_iban)

Data Enrichment

Complement extracted data with external information:

# Look up supplier in database
supplier = database.find_supplier_by_tax_id(extracted_tax_id)
if supplier:
    ocr_result["supplier_id"] = supplier.id
    ocr_result["supplier_category"] = supplier.category

# Validate product codes
for item in line_items:
    product = database.find_product(item["description"])
    if product:
        item["product_id"] = product.id
        item["product_category"] = product.category

Business Rules

Apply organization-specific logic:

# Classify invoice by amount
if total > 10000:
    approval_level = "director"
elif total > 1000:
    approval_level = "manager"
else:
    approval_level = "supervisor"

# Assign to department by supplier
department = get_department_by_supplier(supplier_tax_id)

# Calculate payment date based on terms
payment_date = calculate_payment_date(
    invoice_date,
    payment_terms,
    holidays_calendar
)

Practical Use Cases

Accounts Payable Automation

Scenario: Automatic processing of supplier invoices

1. [Email with invoice] → [Download PDF attachment]
2. [OCR: Extract invoice data]
3. [Validate: Supplier tax ID exists in system]
4. [Verify: Calculations are correct]
5. [Check: Associated purchase order]
6. [If confidence > 95%] → [Register automatically in ERP]
7. [If confidence < 95%] → [Send to human validation]
8. [Update status] → [Notify accounting]

Benefits:

80% reduction in processing time
Elimination of transcription errors
Complete process traceability
Resource liberation for analysis tasks

Contract Management

Scenario: Extraction of expiration dates and key conditions

1. [Signed contract] → [Scan or upload PDF]
2. [OCR: Extract clauses and dates]
3. [AI: Identify renewal conditions]
4. [Extract: Expiration dates]
5. [Create: Calendar alerts]
6. [Register: In document management system]
7. [30 days before expiration] → [Notify responsible party]

Benefits:

Don’t miss renewal dates
Centralization of contractual conditions
Proactive alerts
Facilitates audits and reviews

Expense Control

Scenario: Processing employee receipts and tickets

1. [Employee photographs receipt] → [Sends via mobile app]
2. [OCR: Extract merchant, date, amount]
3. [Classify: Expense type (meals, transport, etc.)]
4. [Validate: Within company policy]
5. [Associate: To project or client]
6. [If valid] → [Approve automatically]
7. [Register: In reimbursement system]
8. [Generate: Monthly expense report]

Benefits:

Immediate reimbursement processing
Compliance with expense policies
Traceability and automatic reporting
Improved employee experience

Customer Onboarding

Scenario: Identity and documentation verification

1. [Customer uploads ID and documents] → [Web portal]
2. [OCR: Extract ID data]
3. [Validate: ID number correct]
4. [Verify: Legal age]
5. [Compare: Data with completed form]
6. [OCR: Process additional documents]
7. [If all OK] → [Activate account automatically]
8. [If discrepancies] → [Request clarification]

Benefits:

Instant onboarding (24/7)
Reduced abandonment
Regulatory compliance (KYC)
Improved customer experience

Best Practices

Document Preparation

Image Quality

To maximize accuracy:

Resolution: Minimum 300 DPI, optimal 400-600 DPI
Format: Preferably PDF, or high-quality PNG/JPG
Lighting: Uniform, without pronounced shadows
Orientation: Document properly aligned
Size: Avoid excessively large images (> 10MB)

Scanning

If scanning physical documents:

Use color or grayscale scanning mode
Avoid plain text mode (less flexibility)
Clean scanner glass
Flatten wrinkled documents
Scan one page per file

Mobile Photography

When using phone:

Good natural or artificial lighting
Avoid glare and reflections
Frame entire document
Keep phone parallel to document
Use apps with automatic perspective correction

Performance Optimization

Batch Processing

For large volumes:

# Process multiple documents in parallel
documents = get_pending_documents()

# Divide into batches of 10
batches = chunk_list(documents, 10)

for batch in batches:
    results = process_ocr_batch(batch, parallel=True)
    save_results(results)

Result Caching

Avoid reprocessing documents:

# Check if already processed
doc_hash = calculate_hash(document)
cached_result = cache.get(doc_hash)

if cached_result:
    return cached_result
else:
    result = process_ocr(document)
    cache.set(doc_hash, result, expiry=7_days)
    return result

Incremental Processing

For multi-page documents:

Process pages in parallel
Allow early-exit if initial pages indicate invalid document
Show progress to user

Error Management

Error Types

try:
    result = process_ocr(document)
except OCRError as e:
    if e.type == "unreadable_document":
        notify_user("Document is not readable. Please improve quality.")
    elif e.type == "unsupported_format":
        notify_user("Unsupported format. Use PDF, JPG, or PNG.")
    elif e.type == "corrupted_file":
        notify_user("File is corrupted. Please upload again.")
    else:
        log_error(e)
        send_to_support(document, e)

Smart Retries

max_retries = 3
retry_count = 0

while retry_count < max_retries:
    try:
        result = process_ocr(document, quality="high")
        break
    except LowConfidenceError:
        retry_count += 1
        if retry_count < max_retries:
            # Retry with higher quality
            document = enhance_image_quality(document)
        else:
            # Send to manual review
            send_to_review_queue(document)

Security and Privacy

Data Minimization

Extract only necessary fields
Don’t store unnecessary personal data
Implement limited retention of original documents

Encryption

Encrypt documents in transit (HTTPS)
Encrypt storage of sensitive documents
Use secrets for external system credentials

Traceability

Log all operations:

audit_log = {
    "timestamp": "2024-03-15T10:30:00Z",
    "user": "user@company.com",
    "action": "ocr_process",
    "document_id": "doc_123456",
    "document_type": "invoice",
    "fields_extracted": ["invoice_number", "total", "supplier_tax_id"],
    "confidence": 0.94,
    "status": "success"
}

log_to_audit_system(audit_log)

Anonymization

For documents with personal data:

# Anonymize before storing for analysis
anonymized = {
    "document_type": result["document_type"],
    "confidence": result["confidence"],
    "processing_time": result["metadata"]["processing_time_ms"],
    # Don't include personal data
}

store_for_analytics(anonymized)

Troubleshooting

Low Extraction Accuracy

Symptoms: Many fields with low confidence or incorrect values

Possible causes:

Insufficient image quality
Non-standard document format
Language not configured correctly
Document type misidentified

Solutions:

Improve image quality (higher resolution, better lighting)
Use specific templates for non-standard documents
Verify configured language is correct
Manually specify document type
Define specific zones for critical fields

Tables Not Recognized

Symptoms: Tables not extracted or lose structure

Possible causes:

Very faint table lines
Table without visible borders
Complex merged cells
Non-standard table format

Solutions:

Activate “advanced table detection” in configuration
Improve document contrast
For borderless tables, use spacing-based detection
Consider manual extraction for complex tables
Define expected table structure in template

Multi-page Documents

Symptoms: Only first page is processed

Possible causes:

Limited page configuration
Processing timeout
Very heavy document

Solutions:

Verify configuration: “Process all pages”
Increase processing timeout
Split very large documents (>50 pages)
Use batch processing for heavy documents

Special Characters Misinterpreted

Symptoms: Symbols or special characters incorrect

Possible causes:

Incorrect encoding
Language not configured
Non-standard typeface

Solutions:

Explicitly configure document language
Verify encoding (UTF-8 recommended)
For handwritten fonts, activate “handwriting recognition”
Apply post-processing to normalize characters

Slow Processing

Symptoms: OCR takes a long time

Possible causes:

Very large document or high resolution
Multi-page processing
Extraction of many tables
Limited system resources

Solutions:

Reduce resolution if > 600 DPI
Process pages in parallel
Use asynchronous processing for large documents
Implement caching for repeated documents
Consider scaling robot resources

Frequently Asked Questions

How accurate is Heptora’s OCR?

Accuracy varies by document type and quality:

Quality digital documents: 95-99% accuracy
Good quality scanned documents: 90-95%
Mobile photographed documents: 85-93%
Low quality documents: 70-85%

Fields with confidence < 80% are marked for review.

Can I process handwritten documents?

Yes, but with limitations. Legible handwriting has 70-85% accuracy. For forms with handwritten fields, it’s better to combine automatic OCR with human review of those specific fields.

How many documents can I process per month?

It depends on your Heptora plan. OCR consumes credits based on:

Number of pages processed
Document complexity (tables, low quality)
Advanced features (AI, validation)

Check your usage dashboard or contact sales.

Are documents stored in the cloud?

It depends on your configuration:

Local mode: Documents processed only on local robot, not sent to cloud
Hybrid mode: Document sent for processing but not permanently stored
Cloud mode: Documents stored according to your retention configuration

Choose based on your privacy requirements.

Can I train OCR with my documents?

Yes. You can create custom templates by training the system with examples of your specific documents. This significantly improves accuracy for proprietary or non-standard formats.

Does OCR work offline?

Basic processing can work locally on the robot, but advanced AI features (classification, semantic validation) require connectivity. Configure mode according to your needs.

What do I do with fields that always have low confidence?

For recurring problematic fields:

Define a specific zone for that field
Adjust validation parameters
Create a custom template
Consider specific post-processing
If it persists, implement human validation only for that field

Need more help?

If this guide didn’t solve your problem or you found an error in the documentation:

Technical support: help@heptora.com
Describe the type of document you’re trying to process
Include a sample document (without sensitive data)
Indicate specific fields with problems
Mention the confidence obtained in fields

Our team will help you optimize OCR for your specific documents.

Process Builder - How to create automations with OCR
Data Validation - Advanced validation rules (coming soon)
ERP Integrations - Connect extracted data with your ERP (coming soon)
Secrets Management - Protect external system credentials