Computer Vision Automation

Computer Vision Automation in Heptora enables interaction with any visible application on screen through visual element recognition, keyboard and mouse simulation, and text reading. It’s the perfect solution for automating legacy applications, systems without APIs, mainframes, and proprietary software.

Visual Application Control

Heptora uses advanced computer vision technology to “see” the screen as a human would, identify visual elements, and execute precise actions on them. This eliminates dependency on APIs or programmatic interfaces.

Visual Automation Advantages

🎯 Universal Compatibility: Works with any visible application, no APIs needed
🖼️ Intelligent Recognition: Identifies buttons, fields, icons, and interface elements
📝 Integrated OCR: Reads and extracts text directly from screen
🎮 Complete Control: Simulates clicks, keystrokes, dragging, and any user action
🔍 Pixel-Perfect Precision: Exact localization of on-screen elements
🏢 Ideal for Legacy: Automates old systems without modernization
🚀 No Modifications: Requires no changes to target applications
🔄 Resilient: Adapts to minor interface changes

Visual Recognition Capabilities

UI Element Identification

The computer vision system recognizes multiple types of on-screen elements:

Standard Components

Buttons and Controls:

Text buttons
Icon buttons
Radio buttons
Checkboxes
Selectors and dropdowns
Sliders and slide controls

Input Fields:

Text fields
Multi-line text areas
Password fields
Numeric fields
Date pickers
Search fields

Navigation Elements:

Menus and submenus
Tabs
Toolbars
Breadcrumbs
Links and hyperlinks
Navigation icons

Visual Elements

Image Recognition:

# Find a specific button by its image
save_button = vision.find_image(
    template="images/save_button.png",
    confidence=0.9
)

if save_button:
    vision.click(save_button.center)
    print(f"Button found at: {save_button.coordinates}")
else:
    print("Button not found on screen")

Recognition features:

Tolerance to color variations
Detection at different scales
Resistance to lighting changes
Configurable partial matching
Search in specific regions

OCR - Text Recognition

Extract text directly from any screen area:

Simple Text Reading

# Read text from a specific area
region = {"x": 100, "y": 200, "width": 300, "height": 50}
text = vision.read_text(region=region)
print(f"Text found: {text}")

Text Search on Screen

# Search for a specific word or phrase
result = vision.find_text(
    text="Total amount",
    exact_match=False
)

if result:
    # Read the numeric value to the right
    value_region = {
        "x": result.x + result.width + 10,
        "y": result.y,
        "width": 100,
        "height": result.height
    }
    value = vision.read_text(region=value_region)
    print(f"Value found: {value}")

Structured Extraction

Tables and Grids:

# Extract data from a visual table
table = vision.extract_table(
    region={"x": 50, "y": 100, "width": 800, "height": 400},
    has_header=True
)

for row in table.rows:
    print(f"ID: {row[0]}, Name: {row[1]}, Amount: {row[2]}")

Forms:

# Extract all fields from a form
form = vision.extract_form_fields(
    region="full_screen"
)

for field, value in form.items():
    print(f"{field}: {value}")

Screen Change Detection

Monitor specific areas to detect updates:

# Wait for an element to appear
vision.wait_for_image(
    template="images/success_message.png",
    timeout=30
)

# Wait for an element to disappear (e.g., loading spinner)
vision.wait_until_gone(
    template="images/spinner.png",
    timeout=60
)

# Wait for a screen area to change
vision.wait_for_change(
    region={"x": 500, "y": 200, "width": 300, "height": 100},
    timeout=10
)

Visual Validation

Verify application state through visual inspection:

# Verify we're on the correct screen
if vision.image_exists("images/app_logo.png"):
    print("Start screen confirmed")
else:
    raise Exception("Not on expected screen")

# Validate that a process completed
if vision.text_exists("Process completed successfully"):
    log.info("Operation finished correctly")
else:
    log.warning("Confirmation message not found")

# Check color of an element (e.g., status indicator)
indicator_color = vision.get_pixel_color(x=850, y=120)
if indicator_color == (0, 255, 0):  # Green
    print("Status: Active")
elif indicator_color == (255, 0, 0):  # Red
    print("Status: Error")

Keyboard and Mouse Control

Mouse Actions

Clicks and Movements

# Simple click at specific coordinates
vision.mouse.click(x=500, y=300)

# Right click
vision.mouse.right_click(x=500, y=300)

# Double click
vision.mouse.double_click(x=500, y=300)

# Move cursor without clicking
vision.mouse.move_to(x=500, y=300)

# Click at the center of a found image
button = vision.find_image("accept_button.png")
vision.mouse.click(button.center)

Drag & Drop

# Drag from one point to another
vision.mouse.drag_to(
    from_x=200,
    from_y=300,
    to_x=400,
    to_y=300
)

# Drag a found element
element = vision.find_image("file.png")
destination = vision.find_image("folder.png")
vision.mouse.drag_to(
    from_x=element.center.x,
    from_y=element.center.y,
    to_x=destination.center.x,
    to_y=destination.center.y
)

# Vertical scroll (positive down, negative up)
vision.mouse.scroll(clicks=-5)  # Scroll up

# Horizontal scroll
vision.mouse.horizontal_scroll(clicks=3)  # Scroll right

# Scroll in a specific region
vision.mouse.move_to(x=600, y=400)
vision.mouse.scroll(clicks=10)  # Scroll in that area

Keyboard Actions

Keys and Combinations

# Type text
vision.keyboard.type("Hello World")

# Special keys
vision.keyboard.press("Enter")
vision.keyboard.press("Tab")
vision.keyboard.press("Escape")

# Key combinations
vision.keyboard.hotkey("Ctrl", "C")  # Copy
vision.keyboard.hotkey("Ctrl", "V")  # Paste
vision.keyboard.hotkey("Ctrl", "S")  # Save
vision.keyboard.hotkey("Alt", "F4")  # Close window

# Hold key down
vision.keyboard.key_down("Shift")
vision.keyboard.press("A")
vision.keyboard.press("B")
vision.keyboard.press("C")
vision.keyboard.key_up("Shift")  # Types "ABC"

Intelligent Typing

# Type with pause between characters (more natural)
vision.keyboard.type("user@example.com", interval=0.1)

# Type with interspersed special keys
vision.keyboard.type("Name: ")
vision.keyboard.type("John Doe")
vision.keyboard.press("Tab")
vision.keyboard.type("Email: ")
vision.keyboard.type("john@example.com")

# Clear field and type
vision.keyboard.hotkey("Ctrl", "A")  # Select all
vision.keyboard.press("Delete")      # Delete
vision.keyboard.type("New text")     # Type

System-Specific Shortcuts

# Windows
vision.keyboard.hotkey("Win", "R")  # Run
vision.keyboard.type("notepad")
vision.keyboard.press("Enter")

# Window operations
vision.keyboard.hotkey("Alt", "Tab")     # Switch window
vision.keyboard.hotkey("Win", "D")       # Show desktop
vision.keyboard.hotkey("Ctrl", "Shift", "Escape")  # Task manager

# Text operations
vision.keyboard.hotkey("Ctrl", "Z")      # Undo
vision.keyboard.hotkey("Ctrl", "Y")      # Redo
vision.keyboard.hotkey("Ctrl", "F")      # Find

Action Combinations

Create complex sequences combining mouse and keyboard:

# Copy text visible on screen
def copy_text_on_screen(region):
    # Triple click to select entire paragraph
    center_x = region["x"] + region["width"] // 2
    center_y = region["y"] + region["height"] // 2

    vision.mouse.move_to(center_x, center_y)
    vision.mouse.triple_click(center_x, center_y)

    # Copy to clipboard
    vision.keyboard.hotkey("Ctrl", "C")
    time.sleep(0.5)

    # Read from clipboard
    return clipboard.paste()

# Fill complete form
def fill_form(data):
    # Click on first field
    first_field = vision.find_text("Name:")
    vision.mouse.click(first_field.x + 100, first_field.y)

    # Fill fields with Tab between them
    vision.keyboard.type(data["name"])
    vision.keyboard.press("Tab")

    vision.keyboard.type(data["lastname"])
    vision.keyboard.press("Tab")

    vision.keyboard.type(data["email"])
    vision.keyboard.press("Tab")

    vision.keyboard.type(data["phone"])

    # Submit form
    vision.keyboard.press("Enter")

Legacy Application Automation

Terminal Systems and Mainframes

Computer vision is ideal for automating terminal applications that lack modern APIs:

Terminal Emulators (AS/400, IBM 3270)

Scenario: Order entry in mainframe system

# Wait for login screen to load
vision.wait_for_text("MAIN SYSTEM", timeout=10)

# Enter credentials
vision.keyboard.type(username)
vision.keyboard.press("Tab")
vision.keyboard.type(password)
vision.keyboard.press("Enter")

# Wait for main menu
vision.wait_for_text("MAIN MENU", timeout=5)

# Navigate to orders module (option 2)
vision.keyboard.type("2")
vision.keyboard.press("Enter")

# Wait for order entry screen
vision.wait_for_text("ORDER ENTRY", timeout=5)

# Fill order form
vision.keyboard.type(customer_code)
vision.keyboard.press("Tab")
vision.keyboard.type(order_number)
vision.keyboard.press("Tab")

# Enter order lines
for item in order_items:
    vision.keyboard.type(item["code"])
    vision.keyboard.press("Tab")
    vision.keyboard.type(str(item["quantity"]))
    vision.keyboard.press("Tab")
    vision.keyboard.press("Enter")  # Confirm line

# Confirm order (F5)
vision.keyboard.press("F5")

# Verify confirmation message
if vision.text_exists("ORDER REGISTERED"):
    # Extract confirmed order number
    number_region = vision.find_text("ORDER NO:")
    confirmed_number = vision.read_text(
        region={
            "x": number_region.x + 150,
            "y": number_region.y,
            "width": 100,
            "height": 20
        }
    )
    log.info(f"Order confirmed: {confirmed_number}")
else:
    raise Exception("Order confirmation not received")

# Return to main menu
vision.keyboard.press("F3")

DOS Applications

# Launch DOS application
os.system("start dosbox legacy_app.exe")

# Wait for application to load
vision.wait_for_text("INVENTORY SYSTEM V3.2", timeout=15)

# Navigate menus (number + Enter)
vision.keyboard.type("1")  # Queries
vision.keyboard.press("Enter")

vision.keyboard.type("3")  # Query by code
vision.keyboard.press("Enter")

# Enter product code
vision.keyboard.type("PROD-12345")
vision.keyboard.press("Enter")

# Extract information from screen
name = vision.read_text(region={"x": 150, "y": 100, "width": 300, "height": 20})
stock = vision.read_text(region={"x": 150, "y": 140, "width": 100, "height": 20})
price = vision.read_text(region={"x": 150, "y": 180, "width": 100, "height": 20})

print(f"Product: {name}")
print(f"Stock: {stock}")
print(f"Price: {price}")

# Exit (multiple Escape)
for _ in range(3):
    vision.keyboard.press("Escape")
    time.sleep(0.5)

Citrix and Remote Desktop Applications

Automate applications running in remote sessions:

# Connect to Citrix session
def connect_citrix(application):
    # Open Citrix Workspace
    vision.keyboard.hotkey("Win", "R")
    vision.keyboard.type("citrix workspace")
    vision.keyboard.press("Enter")

    # Wait for loading
    vision.wait_for_image("citrix_workspace_logo.png", timeout=10)

    # Search for application
    search_field = vision.find_image("search_icon.png")
    vision.mouse.click(search_field.center)
    vision.keyboard.type(application)
    time.sleep(1)

    # Click on first result
    first_result = vision.find_image(f"{application}_icon.png")
    vision.mouse.click(first_result.center)

    # Wait for application launch
    vision.wait_for_text("Loading...", timeout=5)
    vision.wait_until_gone("Loading...", timeout=30)

# Use remote application
connect_citrix("Financial ERP")

# Now interact normally
vision.wait_for_text("MAIN MENU")
# ... rest of automation

Latency Handling

# For remote connections, increase wait times
REMOTE_WAIT_TIME = 2.0

def remote_click(x, y):
    vision.mouse.click(x, y)
    time.sleep(REMOTE_WAIT_TIME)

def remote_type(text):
    vision.keyboard.type(text, interval=0.15)  # Slower
    time.sleep(REMOTE_WAIT_TIME)

def verify_remote_screen_change(expected_text, timeout=30):
    start = time.time()
    while time.time() - start < timeout:
        if vision.text_exists(expected_text):
            time.sleep(REMOTE_WAIT_TIME)  # Additional wait
            return True
        time.sleep(1)
    return False

Proprietary Applications without APIs

Automate closed or undocumented software:

Windows Desktop Applications

# Example: Proprietary medical management software
def register_patient(patient_data):
    # Ensure application is in foreground
    app_window = vision.find_image("app_logo.png")
    if not app_window:
        # Launch application if not open
        vision.keyboard.hotkey("Win", "R")
        vision.keyboard.type("C:\\Program\\MediManage\\mediman.exe")
        vision.keyboard.press("Enter")
        vision.wait_for_image("app_logo.png", timeout=20)

    # Navigate to patients module
    patients_button = vision.find_image("buttons/patients.png")
    vision.mouse.click(patients_button.center)

    # Click "New Patient"
    vision.wait_for_image("buttons/new_patient.png", timeout=5)
    new_button = vision.find_image("buttons/new_patient.png")
    vision.mouse.click(new_button.center)

    # Wait for form to open
    vision.wait_for_text("PATIENT DATA", timeout=3)

    # Fill form
    fields = [
        patient_data["name"],
        patient_data["lastname"],
        patient_data["id_number"],
        patient_data["birth_date"],
        patient_data["phone"],
        patient_data["email"],
        patient_data["address"]
    ]

    for field in fields:
        vision.keyboard.type(str(field))
        vision.keyboard.press("Tab")
        time.sleep(0.3)

    # Save
    save_button = vision.find_image("buttons/save.png")
    vision.mouse.click(save_button.center)

    # Verify confirmation
    if vision.wait_for_text("PATIENT REGISTERED", timeout=5):
        # Extract medical record number
        number_region = vision.find_text("Record No:")
        medical_record = vision.read_text(
            region={
                "x": number_region.x + 120,
                "y": number_region.y,
                "width": 100,
                "height": 20
            }
        )
        return medical_record
    else:
        raise Exception("Could not confirm registration")

Software with Non-Standard Interface

# Example: Application with custom drawn controls
def click_custom_button(button_color, search_region):
    """
    Find and click custom drawn buttons
    based on their distinctive color
    """
    # Capture screenshot of region
    screenshot = vision.capture_region(search_region)

    # Find pixels of button color
    locations = vision.find_color(
        color=button_color,
        tolerance=10,
        region=search_region
    )

    if locations:
        # Calculate center of pixel cluster
        center = calculate_cluster_center(locations)
        vision.mouse.click(center.x, center.y)
        return True
    return False

# Usage
GREEN_ACCEPT = (0, 200, 50)
RED_CANCEL = (200, 50, 50)

button_region = {"x": 400, "y": 500, "width": 300, "height": 100}

if click_custom_button(GREEN_ACCEPT, button_region):
    print("Accept button pressed")

SAP GUI

While SAP has scripting, computer vision can be useful in restricted environments:

# Launch SAP transaction
def execute_sap_transaction(transaction_code):
    # Focus on command field
    command_field = vision.find_image("sap_command_field.png")
    vision.mouse.click(command_field.center)

    # Clear any previous text
    vision.keyboard.hotkey("Ctrl", "A")
    vision.keyboard.press("Delete")

    # Enter transaction
    vision.keyboard.type(f"/n{transaction_code}")
    vision.keyboard.press("Enter")

    # Wait for loading
    time.sleep(2)

# Extract data from SAP table
def extract_sap_table(table_region):
    # Click on first cell
    vision.mouse.click(
        table_region["x"] + 50,
        table_region["y"] + 50
    )

    rows = []
    current_row = 0
    max_rows = 50

    while current_row < max_rows:
        # Select entire row
        vision.keyboard.hotkey("Shift", "End")
        vision.keyboard.hotkey("Ctrl", "C")
        time.sleep(0.3)

        # Get data from clipboard
        row_text = clipboard.paste()

        if not row_text or row_text in rows:
            break  # End of table or repeated row

        rows.append(row_text)

        # Go to next row
        vision.keyboard.press("Down")
        vision.keyboard.press("Home")
        time.sleep(0.2)

        current_row += 1

    return rows

Advanced Techniques

Multiple Monitor Handling

# Specify which monitor to search
primary_monitor = vision.get_monitor(0)
secondary_monitor = vision.get_monitor(1)

# Find element on specific monitor
button = vision.find_image(
    "button.png",
    region=secondary_monitor.bounds
)

# Move cursor between monitors
vision.mouse.move_to(
    secondary_monitor.x + 500,
    secondary_monitor.y + 300
)

Capture and Comparison

# Capture current state of an area
initial_capture = vision.capture_region(
    region={"x": 100, "y": 100, "width": 400, "height": 300}
)

# Perform some action
vision.mouse.click(500, 400)
time.sleep(2)

# Capture new state
final_capture = vision.capture_region(
    region={"x": 100, "y": 100, "width": 400, "height": 300}
)

# Compare if changed
if vision.images_are_different(initial_capture, final_capture, threshold=0.95):
    print("Screen changed after action")
else:
    print("No changes detected")

Adaptive Recognition

# Find element with multiple variants
variants = [
    "save_button_en.png",
    "save_button_es.png",
    "save_button_alternative.png"
]

found_button = None
for variant in variants:
    button = vision.find_image(variant, confidence=0.85)
    if button:
        found_button = button
        break

if found_button:
    vision.mouse.click(found_button.center)
else:
    # Try text search as fallback
    if vision.text_exists("Save"):
        region = vision.find_text("Save")
        vision.mouse.click(region.center)

Coordination with Other Automations

# Combine computer vision with web automation
def hybrid_process():
    # Part 1: Extract data from legacy app with vision
    legacy_data = extract_mainframe_data()

    # Part 2: Process in modern web application
    browser = heptora.browser.create()
    browser.navigate("https://modern-erp.company.com")
    browser.fill_form(legacy_data)

    # Part 3: Verify result with computer vision
    vision.wait_for_text("Synchronization completed")

    return True

Practical Use Cases

Data Migration from Legacy System

Scenario: Extract product catalog from old DOS software

def extract_complete_catalog():
    products = []

    # Open legacy system
    open_inventory_system()

    # Go to product query
    vision.keyboard.type("1")  # Products menu
    vision.keyboard.press("Enter")
    vision.keyboard.type("2")  # List all
    vision.keyboard.press("Enter")

    # Extract page by page
    page = 1
    while True:
        # Read products from current screen
        for line in range(5, 20):  # 15 products per page
            # Position of each field
            y = 80 + (line * 25)

            code = vision.read_text(
                region={"x": 50, "y": y, "width": 100, "height": 20}
            ).strip()

            if not code:
                break  # End of products

            name = vision.read_text(
                region={"x": 160, "y": y, "width": 300, "height": 20}
            ).strip()

            price = vision.read_text(
                region={"x": 470, "y": y, "width": 80, "height": 20}
            ).strip()

            stock = vision.read_text(
                region={"x": 560, "y": y, "width": 60, "height": 20}
            ).strip()

            products.append({
                "code": code,
                "name": name,
                "price": price,
                "stock": stock
            })

        # Try to go to next page
        vision.keyboard.press("PageDown")
        time.sleep(1)

        # Check if there are more pages (look for indicator)
        if vision.text_exists("END OF LIST"):
            break

        page += 1

        if page > 100:  # Safety limit
            break

    return products

# Export to modern format
products = extract_complete_catalog()
df = pandas.DataFrame(products)
df.to_excel("migrated_catalog.xlsx", index=False)
print(f"Extracted {len(products)} products")

Medical Application Automation

Scenario: Laboratory results registration

def process_lab_results(results_file):
    # Read results from file
    results = read_excel(results_file)

    # Open hospital management application
    open_hospital_app()

    for result in results:
        # Search for patient
        search_patient(result["id_number"])

        # Go to laboratory module
        lab_button = vision.find_image("modules/laboratory.png")
        vision.mouse.click(lab_button.center)

        # New result
        vision.keyboard.hotkey("Ctrl", "N")
        vision.wait_for_text("NEW RESULT")

        # Fill form
        vision.keyboard.type(result["analysis_type"])
        vision.keyboard.press("Tab")

        vision.keyboard.type(result["date"])
        vision.keyboard.press("Tab")

        # Enter values
        for value in result["values"]:
            vision.keyboard.type(value["parameter"])
            vision.keyboard.press("Tab")
            vision.keyboard.type(str(value["result"]))
            vision.keyboard.press("Tab")
            vision.keyboard.type(value["unit"])
            vision.keyboard.press("Enter")

        # Save
        vision.keyboard.hotkey("Ctrl", "S")

        # Verify saved
        if vision.wait_for_text("RESULT SAVED", timeout=5):
            log.info(f"Result saved for patient {result['id_number']}")
        else:
            log.error(f"Error saving result for {result['id_number']}")
            screenshot = vision.capture_screen()
            vision.save_image(screenshot, f"error_{result['id_number']}.png")

        # Return to main menu
        vision.keyboard.press("Escape")
        vision.keyboard.press("Escape")

Batch Process Validation

Scenario: Verify a batch process executed correctly

def verify_batch_process(process_name):
    # Open system monitor
    open_system_monitor()

    # Search for specific process
    search_field = vision.find_image("search_icon.png")
    vision.mouse.click(search_field.center)
    vision.keyboard.type(process_name)
    vision.keyboard.press("Enter")
    time.sleep(2)

    # Verify status
    status_region = vision.find_text("Status:")
    status_text = vision.read_text(
        region={
            "x": status_region.x + 80,
            "y": status_region.y,
            "width": 150,
            "height": 25
        }
    )

    # Visually verify indicator color
    indicator_color = vision.get_pixel_color(
        x=status_region.x - 30,
        y=status_region.y + 10
    )

    if "COMPLETED" in status_text.upper() and indicator_color[1] > 200:  # Green
        log.info(f"Process {process_name} completed successfully")

        # Extract statistics
        records = vision.read_text_after_label("Records processed:")
        errors = vision.read_text_after_label("Errors:")
        duration = vision.read_text_after_label("Duration:")

        return {
            "status": "COMPLETED",
            "records": records,
            "errors": errors,
            "duration": duration
        }
    else:
        log.error(f"Process {process_name} failed or pending")
        return {"status": "ERROR"}

Best Practices

Search Strategies

Use stable references
- Search for elements that don’t change (logos, titles)
- Avoid dynamic elements like timestamps

Combine methods

# Robust method: try image, then text
button = vision.find_image("next_button.png")
if not button:
    button = vision.find_text("Next")
if button:
    vision.mouse.click(button.center)

Use search regions

# Faster and more precise
button = vision.find_image(
    "button.png",
    region={"x": 700, "y": 400, "width": 200, "height": 100}
)

Error Handling

def robust_action(max_attempts=3):
    for attempt in range(max_attempts):
        try:
            # Try the action
            vision.mouse.click_image("button.png")

            # Verify it worked
            if vision.wait_for_text("Confirmation", timeout=5):
                return True
        except ElementNotFoundError:
            if attempt < max_attempts - 1:
                log.warning(f"Attempt {attempt + 1} failed, retrying...")
                time.sleep(2)
            else:
                # Capture error evidence
                screenshot = vision.capture_screen()
                vision.save_image(screenshot, "final_error.png")
                raise

    return False

Performance Optimization

Cache template images

# Load templates at start
TEMPLATES = {
    "save": vision.load_template("buttons/save.png"),
    "cancel": vision.load_template("buttons/cancel.png"),
    "accept": vision.load_template("buttons/accept.png")
}

# Use cached templates
button = vision.find_template(TEMPLATES["save"])

Reduce search area

# Define common regions
REGIONS = {
    "bottom_buttons": {"x": 0, "y": 700, "width": 1920, "height": 280},
    "top_menu": {"x": 0, "y": 0, "width": 1920, "height": 100},
    "right_panel": {"x": 1400, "y": 100, "width": 520, "height": 900}
}

Adjust confidence levels

# For stable elements, use high confidence
logo = vision.find_image("logo.png", confidence=0.95)

# For variable elements, reduce confidence
button = vision.find_image("dynamic_button.png", confidence=0.75)

Evidence Capture

def execute_with_evidence(process_name):
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    evidence_folder = f"evidence/{process_name}_{timestamp}"
    os.makedirs(evidence_folder, exist_ok=True)

    step = 0

    def capture_step(description):
        nonlocal step
        step += 1
        screenshot = vision.capture_screen()
        path = f"{evidence_folder}/step_{step:02d}_{description}.png"
        vision.save_image(screenshot, path)
        log.info(f"Captured: {description}")

    try:
        capture_step("start")

        # Execute process
        open_application()
        capture_step("application_opened")

        perform_operation()
        capture_step("operation_completed")

        capture_step("successful_end")

    except Exception as e:
        capture_step("error")
        log.error(f"Error: {str(e)}")
        raise

Troubleshooting

Element Not Found

Symptoms: vision.find_image() or vision.find_text() returns None

Solutions:

Verify element is visible on screen
Reduce confidence level: confidence=0.8 instead of 0.9
Capture new template under current conditions
Use more specific search region
Try searching by text instead of image

Clicks Not Registered

Symptoms: Click executes but has no effect

Solutions:

Add time.sleep(0.5) after click
Verify application has focus
Use double_click() if necessary
Verify click coordinates with screenshot

Inaccurate OCR

Symptoms: Text read incorrectly

Solutions:

Increase contrast of captured region
Specify OCR language: vision.read_text(lang='eng')
Preprocess image (grayscale, thresholding)
Use region more tightly fitted to specific text

Slow Performance

Symptoms: Searches take too long

Solutions:

Use limited search regions
Reduce template resolution
Cache loaded templates
Avoid full screen searches

Frequently Asked Questions

Does computer vision work at any resolution?

Yes, but it’s recommended to capture templates at the same resolution where the process will run. For multi-resolution, use text search instead of image when possible.

Can I automate applications in headless mode?

No, computer vision requires the application to be visible on screen. For background automation, consider using the robot in a virtual machine with visible desktop.

Does it work with applications in multiple languages?

Yes, but you must capture templates for each language or use text recognition with the appropriate language specified.

What if the interface changes slightly?

You can adjust the confidence level to tolerate minor variations. For major changes, you’ll need to update image templates.

Can I combine computer vision with web automation?

Absolutely. It’s common to use computer vision for legacy systems and web automation for modern systems in the same process.

How do I handle unexpected pop-ups?

Implement periodic checks:

if vision.image_exists("error_popup.png"):
    vision.mouse.click_image("close_popup_button.png")

Need more help?

If this guide didn’t solve your problem or you found an error in the documentation:

Technical support: help@heptora.com
Describe the application you’re trying to automate
Include screenshots of the interface
Indicate which element you cannot locate or interact with
Mention operating system and screen resolution

Our team will help you design the best visual automation strategy for your specific case.

Process Recorder - Record visual actions automatically
OCR & Document Processing - Advanced OCR for documents
Process Builder - How to create visual automations
Secrets Management - Protect legacy system credentials

Computer Vision Automation

Visual Application Control

Visual Automation Advantages

Visual Recognition Capabilities

UI Element Identification

Standard Components

Visual Elements

OCR - Text Recognition

Simple Text Reading

Text Search on Screen

Structured Extraction

Screen Change Detection

Visual Validation

Keyboard and Mouse Control

Mouse Actions

Clicks and Movements

Drag & Drop

Scroll and Navigation

Keyboard Actions

Keys and Combinations

Intelligent Typing

System-Specific Shortcuts

Action Combinations

Legacy Application Automation

Terminal Systems and Mainframes

Terminal Emulators (AS/400, IBM 3270)

DOS Applications

Citrix and Remote Desktop Applications

Connection and Navigation

Latency Handling

Proprietary Applications without APIs

Windows Desktop Applications

Software with Non-Standard Interface

SAP GUI

Advanced Techniques

Multiple Monitor Handling

Capture and Comparison

Adaptive Recognition

Coordination with Other Automations

Practical Use Cases

Data Migration from Legacy System

Medical Application Automation

Batch Process Validation

Best Practices

Search Strategies

Error Handling

Performance Optimization

Evidence Capture

Troubleshooting

Element Not Found

Clicks Not Registered

Inaccurate OCR

Slow Performance

Frequently Asked Questions

Does computer vision work at any resolution?

Can I automate applications in headless mode?

Does it work with applications in multiple languages?

What if the interface changes slightly?

Can I combine computer vision with web automation?

How do I handle unexpected pop-ups?

Need more help?

Related Resources