Skip to content

OCR & Document Processing

OCR (Optical Character Recognition) is an advanced feature of Heptora that allows you to automatically extract information from physical or digital documents. It converts printed, handwritten, or digital text into structured data that can be processed, validated, and used in your automations.

Heptora’s OCR system eliminates the need for manual data entry, reducing errors and dramatically accelerating document processing in your workflows.

  • 📄 Multiple Formats: Processes PDF, JPG/PNG/TIFF images, and scanned documents
  • 🤖 Integrated AI: Automatic document type classification
  • 🎯 Intelligent Extraction: Automatically identifies key fields without prior configuration
  • 📊 Complex Structures: Recognizes tables, grids, and complex layouts
  • ✓ Automatic Validation: Verifies formats of tax IDs, IBANs, dates, and other data
  • 🌍 Multilingual: Support for multiple languages and special characters
  • 📈 High Accuracy: Works with variable quality documents

Heptora’s OCR can process a wide variety of input formats:

  • Native PDFs: Digitally created PDF documents
  • Scanned PDFs: Physical documents converted to PDF
  • Hybrid documents: PDFs with both digital and scanned content
  • JPG/JPEG: Document photographs
  • PNG: Screenshots and digital documents
  • TIFF: High-quality scanned documents
  • BMP: Bitmap images

The system automatically adapts to different conditions:

  • Documents with variable resolution (from 150 DPI)
  • Images with uneven lighting
  • Documents with slight rotation or tilt
  • Texts with different font sizes
  • Documents with watermarks or stamps

Heptora’s OCR goes beyond simple text extraction, identifying document structure:

  • Headers: Main titles and sections
  • Paragraphs: Text blocks with semantic structure
  • Lists: Enumerated or bulleted items
  • Footnotes: References and annotations
  • Form fields: Data in predefined templates

The system recognizes and preserves tabular structure:

{
"table_1": {
"headers": ["Item", "Quantity", "Price", "Total"],
"rows": [
["Product A", "10", "$25.00", "$250.00"],
["Product B", "5", "$40.00", "$200.00"]
],
"total_rows": 2
}
}

Identification of relevant non-textual elements:

  • Logos: Extraction and position of corporate images
  • Signatures: Detection of signed areas
  • Barcodes: Reading 1D and 2D codes
  • QR Codes: Extraction of encoded information
  • Stamps: Identification of official marks

Each extracted element includes its exact location in the document:

{
"field": "tax_id",
"value": "12-3456789",
"confidence": 0.98,
"coordinates": {
"x": 120,
"y": 350,
"width": 100,
"height": 20,
"page": 1
}
}

This allows:

  • Verifying the expected position of critical fields
  • Detecting misplaced or missing fields
  • Creating visualizations of the extraction process
  • Validating document structure

Complete extraction of commercial invoice information:

  • Business name and trade name
  • Issuer tax ID
  • Complete fiscal address
  • Contact details (phone, email, website)
  • Customer name or business name
  • Recipient tax ID
  • Billing address
  • Shipping address (if different)
  • Invoice number
  • Invoice series
  • Issue date
  • Due date
  • Billing period
  • Product/service descriptions
  • Quantities and units
  • Unit prices
  • Applied discounts
  • Tax base by tax rate
  • Itemized tax amounts
  • Withholdings (income tax, etc.)
  • Total invoice amount
  • Payment method
  • Bank details (IBAN)
  • Purchase order reference
  • Notes and observations

Intelligent analysis of contractual documents:

  • Names of contracting parties
  • Legal representatives
  • Powers and authorities
  • Registered addresses
  • Contract subject
  • Duration and validity
  • Automatic renewals
  • Termination conditions
  • Penalties
  • Price or consideration
  • Payment method and terms
  • Price reviews
  • Guarantees and bonds
  • Signature date
  • Effective start date
  • End date
  • Important milestones
  • Detection of signature areas
  • Identification of signatories
  • List of mentioned annexes
  • References to external documents

Automated processing of structured forms:

  • Free text: Names, addresses, comments
  • Checkboxes: Checked/unchecked options
  • Radio buttons: Single selection among options
  • Dropdown lists: Selected values
  • Dates: In various formats (mm/dd/yyyy, etc.)
  • Signatures: Handwritten or digital
  • Required fields completed
  • Correct data format
  • Consistency between related fields
  • Detection of blank fields
  • Job applications
  • Registration forms
  • Surveys and questionnaires
  • Medical forms
  • Administrative declarations

Data extraction from certified documents:

  • Issuing institution
  • Degree obtained
  • Grades
  • Issue date
  • Registration number
  • Certifying organization
  • Certification type
  • Level or category
  • Issue and expiration dates
  • Verification code
  • Issuing entity
  • Certification subject
  • Beneficiary data
  • Validity
  • Official seals and signatures

Secure extraction of personal identification data:

  • Document number
  • Full name
  • Date of birth
  • Nationality
  • Issue and expiration dates
  • Support number
  • Passport number
  • Document type
  • Issuing country
  • Personal data
  • MRZ (Machine Readable Zone)
  • Issue and expiration dates
  • License number
  • Authorized categories
  • Issue date
  • Expiration date
  • Restrictions

Processing of payment vouchers:

  • Issuing merchant
  • Merchant tax ID
  • Purchase date and time
  • List of products/services
  • Individual prices
  • Applied discounts
  • Total paid
  • Payment method
  • Payment concept
  • Issuer and recipient
  • Amount
  • Payment date
  • Payment method
  • Receipt reference
  • Company expense management
  • Parking ticket control
  • Utility receipt processing
  • Payment reconciliation

The system includes specific validators for structured data:

  • Validation of check digit algorithm
  • Verification of correct format
  • Detection of impossible numbers
  • Identification of type (individual/business)
  • Validation of country code
  • Verification of check digits
  • Format according to international standard
  • Correct length by country
  • Recognized formats: mm/dd/yyyy, dd-mm-yy, yyyy-mm-dd, etc.
  • Validation of impossible dates (February 31st, etc.)
  • Normalization to standard format
  • Detection of temporal inconsistencies
  • Recognition of decimal separators (. or ,)
  • Detection of currency symbols ($, €, etc.)
  • Normalization to numeric format
  • Validation of expected ranges
  • Email format validation
  • URL structure verification
  • Domain detection

The system automatically identifies anomalies:

{
"error": "calculation_mismatch",
"field": "total_invoice",
"extracted_value": "$1,250.00",
"calculated_value": "$1,235.50",
"difference": "$14.50",
"severity": "high"
}
  • Empty required fields
  • Incomplete sections
  • Missing pages (in multi-page documents)
  • Amounts outside expected range
  • Future dates in historical documents
  • Duplicate data
  • Inconsistent formats

Artificial intelligence complements extraction with additional analysis:

The system identifies document type without prior configuration:

{
"document_type": "invoice",
"confidence": 0.95,
"sub_type": "service_invoice",
"detected_features": [
"invoice_number",
"tax_breakdown",
"line_items",
"company_header"
]
}

Understands content meaning, not just text:

  • Named entities: People, organizations, locations
  • Relationships: Who invoices whom, who signs what
  • Intentions: Request, notification, certification
  • Sentiment: Document tone (for contracts and communications)

Automatic document organization:

  • By document type
  • By supplier or customer
  • By responsible department
  • By date or period
  • By amount or relevance

Each extracted data includes a certainty level:

{
"invoice_number": {
"value": "INV-2024-00123",
"confidence": 0.99,
"status": "verified"
},
"invoice_date": {
"value": "2024-03-15",
"confidence": 0.95,
"status": "verified"
},
"total_amount": {
"value": "$1,250.00",
"confidence": 0.72,
"status": "review_required",
"reason": "low_image_quality"
}
}
  • 0.95 - 1.00: Automatically verified
  • 0.80 - 0.94: Accepted with validation
  • 0.60 - 0.79: Review recommended
  • < 0.60: Review required

Specialized interface for human validation of low-confidence data:

  • Visualization of source document
  • Highlighting of extracted fields
  • Zoom on problematic areas
  • Page navigation
  • List of fields to review
  • Confidence indicator per field
  • Alternative suggestions
  • History of similar extractions
  • Direct value editing
  • Selection among suggested options
  • Marking fields as correct
  • Indication of OCR errors
  1. System marks fields with confidence < 0.80
  2. Sent to human review queue
  3. User validates or corrects values
  4. System learns from corrections
  5. Validated data integrated into process

OCR integrates as a draggable block in the visual process designer:

Block: OCR Document Processing
Input: Document (file or URL)
Configuration:
- Document type: Invoice
- Language: English
- Quality: High precision
Output: Structured data (JSON)

The OCR block can be placed at any point in the process:

[Receive Email] → [Download Attachment] → [OCR] → [Validate Data] → [Insert into ERP]

From the visual builder you can:

  • Select document type
  • Define required fields
  • Establish validation rules
  • Configure actions based on confidence
  • Define alternative flows for review

For documents with consistent layout, you can define specific zones:

Define exact document areas:

{
"zones": [
{
"name": "invoice_number",
"coordinates": {
"x": 450,
"y": 100,
"width": 150,
"height": 30
},
"page": 1,
"type": "text",
"validation": "alphanumeric"
},
{
"name": "total_amount",
"coordinates": {
"x": 450,
"y": 650,
"width": 100,
"height": 25
},
"page": 1,
"type": "currency",
"validation": "positive_number"
}
]
}

Define areas relative to fixed elements:

{
"zone": "client_name",
"reference_text": "Customer:",
"offset_x": 100,
"offset_y": 0,
"width": 300,
"height": 20
}
  • Greater precision in structured documents
  • Shorter processing time
  • Reduction of false positives
  • Stricter validation

Predefined models to accelerate configuration:

Heptora includes templates for the most common documents:

  • Generic invoices: Standard US/international model
  • Electronic invoices: Standard electronic format
  • Delivery notes: Shipping documents
  • Purchase orders: Order documents
  • Employment contracts: Standard models
  • Identity documents: Various national IDs

For organization-specific documents:

  1. Upload sample documents (minimum 3-5 examples)
  2. Label key fields in each example
  3. Define specific validations
  4. Test with new documents
  5. Refine and publish the template
OCR Configuration:
template: "vendor_invoice_xyz"
fallback: "generic_invoice"
confidence_threshold: 0.85

The OCR result is a complete JSON object:

{
"document_id": "doc_20240315_123456",
"processing_date": "2024-03-15T10:30:00Z",
"document_type": "invoice",
"confidence": 0.94,
"pages": 1,
"language": "en",
"extracted_data": {
"invoice_number": {
"value": "INV-2024-00123",
"confidence": 0.99,
"coordinates": {"x": 450, "y": 100, "width": 150, "height": 30}
},
"invoice_date": {
"value": "2024-03-15",
"confidence": 0.97,
"coordinates": {"x": 450, "y": 130, "width": 100, "height": 25}
},
"supplier": {
"name": "Example Supplier Inc.",
"tax_id": "12-3456789",
"address": "123 Main Street, New York, NY 10001"
},
"customer": {
"name": "My Company LLC",
"tax_id": "98-7654321",
"address": "456 Business Ave, Los Angeles, CA 90001"
},
"line_items": [
{
"description": "Product A",
"quantity": 10,
"unit_price": 25.00,
"total": 250.00
}
],
"totals": {
"subtotal": 250.00,
"tax": 20.00,
"total": 270.00,
"currency": "USD"
}
},
"validation": {
"status": "validated",
"errors": [],
"warnings": ["Image quality could be improved"]
},
"metadata": {
"file_name": "invoice_example.pdf",
"file_size": 245678,
"processing_time_ms": 2340
}
}

In your process, access extracted data:

# Get OCR result
ocr_result = step_output["ocr_document"]
# Access specific fields
invoice_num = ocr_result["extracted_data"]["invoice_number"]["value"]
total = ocr_result["extracted_data"]["totals"]["total"]
supplier_tax_id = ocr_result["extracted_data"]["supplier"]["tax_id"]
# Check confidence
if ocr_result["confidence"] > 0.9:
# Automatic processing
process_automatically(ocr_result)
else:
# Send to review
send_to_review(ocr_result)

Transform and normalize extracted data:

# Normalize tax IDs (remove spaces, hyphens)
tax_id_clean = normalize_tax_id(extracted_tax_id)
# Convert dates to ISO format
date_iso = convert_to_iso_date(extracted_date)
# Format amounts
amount_decimal = parse_currency(extracted_amount)
# Validate and format IBAN
iban_formatted = validate_and_format_iban(extracted_iban)

Complement extracted data with external information:

# Look up supplier in database
supplier = database.find_supplier_by_tax_id(extracted_tax_id)
if supplier:
ocr_result["supplier_id"] = supplier.id
ocr_result["supplier_category"] = supplier.category
# Validate product codes
for item in line_items:
product = database.find_product(item["description"])
if product:
item["product_id"] = product.id
item["product_category"] = product.category

Apply organization-specific logic:

# Classify invoice by amount
if total > 10000:
approval_level = "director"
elif total > 1000:
approval_level = "manager"
else:
approval_level = "supervisor"
# Assign to department by supplier
department = get_department_by_supplier(supplier_tax_id)
# Calculate payment date based on terms
payment_date = calculate_payment_date(
invoice_date,
payment_terms,
holidays_calendar
)

Scenario: Automatic processing of supplier invoices

1. [Email with invoice] → [Download PDF attachment]
2. [OCR: Extract invoice data]
3. [Validate: Supplier tax ID exists in system]
4. [Verify: Calculations are correct]
5. [Check: Associated purchase order]
6. [If confidence > 95%] → [Register automatically in ERP]
7. [If confidence < 95%] → [Send to human validation]
8. [Update status] → [Notify accounting]

Benefits:

  • 80% reduction in processing time
  • Elimination of transcription errors
  • Complete process traceability
  • Resource liberation for analysis tasks

Scenario: Extraction of expiration dates and key conditions

1. [Signed contract] → [Scan or upload PDF]
2. [OCR: Extract clauses and dates]
3. [AI: Identify renewal conditions]
4. [Extract: Expiration dates]
5. [Create: Calendar alerts]
6. [Register: In document management system]
7. [30 days before expiration] → [Notify responsible party]

Benefits:

  • Don’t miss renewal dates
  • Centralization of contractual conditions
  • Proactive alerts
  • Facilitates audits and reviews

Scenario: Processing employee receipts and tickets

1. [Employee photographs receipt] → [Sends via mobile app]
2. [OCR: Extract merchant, date, amount]
3. [Classify: Expense type (meals, transport, etc.)]
4. [Validate: Within company policy]
5. [Associate: To project or client]
6. [If valid] → [Approve automatically]
7. [Register: In reimbursement system]
8. [Generate: Monthly expense report]

Benefits:

  • Immediate reimbursement processing
  • Compliance with expense policies
  • Traceability and automatic reporting
  • Improved employee experience

Scenario: Identity and documentation verification

1. [Customer uploads ID and documents] → [Web portal]
2. [OCR: Extract ID data]
3. [Validate: ID number correct]
4. [Verify: Legal age]
5. [Compare: Data with completed form]
6. [OCR: Process additional documents]
7. [If all OK] → [Activate account automatically]
8. [If discrepancies] → [Request clarification]

Benefits:

  • Instant onboarding (24/7)
  • Reduced abandonment
  • Regulatory compliance (KYC)
  • Improved customer experience

To maximize accuracy:

  • Resolution: Minimum 300 DPI, optimal 400-600 DPI
  • Format: Preferably PDF, or high-quality PNG/JPG
  • Lighting: Uniform, without pronounced shadows
  • Orientation: Document properly aligned
  • Size: Avoid excessively large images (> 10MB)

If scanning physical documents:

  • Use color or grayscale scanning mode
  • Avoid plain text mode (less flexibility)
  • Clean scanner glass
  • Flatten wrinkled documents
  • Scan one page per file

When using phone:

  • Good natural or artificial lighting
  • Avoid glare and reflections
  • Frame entire document
  • Keep phone parallel to document
  • Use apps with automatic perspective correction

For large volumes:

# Process multiple documents in parallel
documents = get_pending_documents()
# Divide into batches of 10
batches = chunk_list(documents, 10)
for batch in batches:
results = process_ocr_batch(batch, parallel=True)
save_results(results)

Avoid reprocessing documents:

# Check if already processed
doc_hash = calculate_hash(document)
cached_result = cache.get(doc_hash)
if cached_result:
return cached_result
else:
result = process_ocr(document)
cache.set(doc_hash, result, expiry=7_days)
return result

For multi-page documents:

  • Process pages in parallel
  • Allow early-exit if initial pages indicate invalid document
  • Show progress to user
try:
result = process_ocr(document)
except OCRError as e:
if e.type == "unreadable_document":
notify_user("Document is not readable. Please improve quality.")
elif e.type == "unsupported_format":
notify_user("Unsupported format. Use PDF, JPG, or PNG.")
elif e.type == "corrupted_file":
notify_user("File is corrupted. Please upload again.")
else:
log_error(e)
send_to_support(document, e)
max_retries = 3
retry_count = 0
while retry_count < max_retries:
try:
result = process_ocr(document, quality="high")
break
except LowConfidenceError:
retry_count += 1
if retry_count < max_retries:
# Retry with higher quality
document = enhance_image_quality(document)
else:
# Send to manual review
send_to_review_queue(document)
  • Extract only necessary fields
  • Don’t store unnecessary personal data
  • Implement limited retention of original documents
  • Encrypt documents in transit (HTTPS)
  • Encrypt storage of sensitive documents
  • Use secrets for external system credentials

Log all operations:

audit_log = {
"timestamp": "2024-03-15T10:30:00Z",
"user": "user@company.com",
"action": "ocr_process",
"document_id": "doc_123456",
"document_type": "invoice",
"fields_extracted": ["invoice_number", "total", "supplier_tax_id"],
"confidence": 0.94,
"status": "success"
}
log_to_audit_system(audit_log)

For documents with personal data:

# Anonymize before storing for analysis
anonymized = {
"document_type": result["document_type"],
"confidence": result["confidence"],
"processing_time": result["metadata"]["processing_time_ms"],
# Don't include personal data
}
store_for_analytics(anonymized)

Symptoms: Many fields with low confidence or incorrect values

Possible causes:

  • Insufficient image quality
  • Non-standard document format
  • Language not configured correctly
  • Document type misidentified

Solutions:

  1. Improve image quality (higher resolution, better lighting)
  2. Use specific templates for non-standard documents
  3. Verify configured language is correct
  4. Manually specify document type
  5. Define specific zones for critical fields

Symptoms: Tables not extracted or lose structure

Possible causes:

  • Very faint table lines
  • Table without visible borders
  • Complex merged cells
  • Non-standard table format

Solutions:

  1. Activate “advanced table detection” in configuration
  2. Improve document contrast
  3. For borderless tables, use spacing-based detection
  4. Consider manual extraction for complex tables
  5. Define expected table structure in template

Symptoms: Only first page is processed

Possible causes:

  • Limited page configuration
  • Processing timeout
  • Very heavy document

Solutions:

  1. Verify configuration: “Process all pages”
  2. Increase processing timeout
  3. Split very large documents (>50 pages)
  4. Use batch processing for heavy documents

Symptoms: Symbols or special characters incorrect

Possible causes:

  • Incorrect encoding
  • Language not configured
  • Non-standard typeface

Solutions:

  1. Explicitly configure document language
  2. Verify encoding (UTF-8 recommended)
  3. For handwritten fonts, activate “handwriting recognition”
  4. Apply post-processing to normalize characters

Symptoms: OCR takes a long time

Possible causes:

  • Very large document or high resolution
  • Multi-page processing
  • Extraction of many tables
  • Limited system resources

Solutions:

  1. Reduce resolution if > 600 DPI
  2. Process pages in parallel
  3. Use asynchronous processing for large documents
  4. Implement caching for repeated documents
  5. Consider scaling robot resources

Accuracy varies by document type and quality:

  • Quality digital documents: 95-99% accuracy
  • Good quality scanned documents: 90-95%
  • Mobile photographed documents: 85-93%
  • Low quality documents: 70-85%

Fields with confidence < 80% are marked for review.

Yes, but with limitations. Legible handwriting has 70-85% accuracy. For forms with handwritten fields, it’s better to combine automatic OCR with human review of those specific fields.

How many documents can I process per month?

Section titled “How many documents can I process per month?”

It depends on your Heptora plan. OCR consumes credits based on:

  • Number of pages processed
  • Document complexity (tables, low quality)
  • Advanced features (AI, validation)

Check your usage dashboard or contact sales.

It depends on your configuration:

  • Local mode: Documents processed only on local robot, not sent to cloud
  • Hybrid mode: Document sent for processing but not permanently stored
  • Cloud mode: Documents stored according to your retention configuration

Choose based on your privacy requirements.

Yes. You can create custom templates by training the system with examples of your specific documents. This significantly improves accuracy for proprietary or non-standard formats.

Basic processing can work locally on the robot, but advanced AI features (classification, semantic validation) require connectivity. Configure mode according to your needs.

What do I do with fields that always have low confidence?

Section titled “What do I do with fields that always have low confidence?”

For recurring problematic fields:

  1. Define a specific zone for that field
  2. Adjust validation parameters
  3. Create a custom template
  4. Consider specific post-processing
  5. If it persists, implement human validation only for that field

If this guide didn’t solve your problem or you found an error in the documentation:

  • Technical support: help@heptora.com
  • Describe the type of document you’re trying to process
  • Include a sample document (without sensitive data)
  • Indicate specific fields with problems
  • Mention the confidence obtained in fields

Our team will help you optimize OCR for your specific documents.