Desktop automation has reached an inflection point. For two decades, Robotic Process Automation (RPA) dominated enterprise workflow automation through deterministic scripting. Today, a fundamentally different architectureβvision-language-action (VLA) GUI agentsβchallenges the assumption that automation requires brittle, hand-coded selectors. These are not competing products on the same spectrum; they represent distinct architectural paradigms optimized for different problem classes.
This article dissects both architectures at the systems level, examines where each fails, and analyzes how Mano-P, an open-source GUI agent project by Mininglamp Technology, implements the VLA paradigm with on-device inference.
RPA toolsβUiPath, Automation Anywhere, Blue Prismβoperate on a selector-action model. Each automation step identifies a UI element via DOM path, CSS selector, accessibility attribute, or pixel coordinate, then executes a predefined action. This architecture carries four compounding failure modes:
DOM Coupling and Selector Fragility. A single UI updateβrenamed button ID, restructured div hierarchy, relocated modalβbreaks the entire downstream chain. Enterprise RPA deployments report 30-40% of maintenance effort goes to selector repair after application updates. This is not a bug; it is the architectural consequence of coupling automation logic to implementation-specific element identifiers rather than semantic intent.
Maintenance Scaling. The relationship between automation count and maintenance burden is superlinear. Each new bot adds not just its own maintenance surface but interaction complexity with shared UI elements. Organizations with 200+ bots frequently employ dedicated "bot repair" teams larger than the original development team.
Cross-Application Boundaries. RPA operates within single-application contexts. Workflows spanning multiple applications require explicit handoff logicβclipboard operations, file watchers, inter-process communication hacks. A task trivial for a human ("copy this table from the PDF into the spreadsheet, then email it") becomes a fragile multi-stage pipeline with failure modes at every boundary.
Semantic Blindness. RPA has no understanding of what it is doing. It cannot distinguish a "Submit" button from a "Cancel" button except by selector match. When an application presents an unexpected dialog ("Are you sure you want to delete all records?"), a selector-based bot either crashes or, worse, proceeds with the wrong action. There is no reasoning layer to evaluate whether the current screen state matches the expected workflow context.
The evolution from scripted automation to intelligent agents follows a clear architectural progression:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Generation 1: Selector-Action (RPA) β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β Selector βββββΆβ Action βββββΆβ Selector βββββΆ ... β
β β (brittle)β β(hardcoded)β β (brittle)β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β Failure mode: any UI change breaks the chain β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Generation 2: Vision + LLM (Set-of-Marks, early agents) β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β βScreenshotβββββΆβ LLM Plan βββββΆβClick x,y βββββΆ ... β
β β + Labels β β(per-step) β β(no verify)β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β Failure mode: no grounding, no error recovery β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Generation 3: VLA Unified Model (Mano-P) β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β Visual βββββΆβ Reason + βββββΆβ Action βββββΆ Verify β
β β Encoding β β Ground β β Predict β βββ β
β ββββββββββββ ββββββββββββ ββββββββββββ β β
β β² β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β Key: closed-loop perception-reasoning-action-verification β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Generation 1 treats automation as scripting. Generation 2 adds perception but remains open-loopβscreenshot in, coordinate out, no verification that the action succeeded. Generation 3, implemented in Mano-P, closes the loop: the same model that perceives the screen also reasons about intent, predicts actions, and verifies outcomes before proceeding.
Mano-P, open-sourced by Mininglamp Technology under Apache 2.0, implements a unified Vision-Language-Action architecture where visual perception, language reasoning, and action prediction occur within a single model forward pass rather than as separate pipeline stages.
The VLA architecture unifies three traditionally separate capabilities into a single transformer backbone:
Visual Encoding. Raw screen frames are encoded through a vision transformer that produces spatial feature maps preserving both fine-grained element details (button text, icon shape) and global layout structure (window arrangement, relative positioning). Unlike Set-of-Marks approaches that overlay numbered labels onto screenshots, Mano-P's visual encoder learns to ground elements directly from pixel spaceβeliminating the information loss and visual clutter of annotation-based methods.
Language Reasoning. The language component serves dual functions: (1) interpreting the user's natural language task description and maintaining multi-turn dialogue context, and (2) generating explicit reasoning traces ("thinking") before committing to actions. This is not prompt engineering on top of a general LLMβthe language reasoning is jointly trained with visual grounding and action prediction, creating shared representations where linguistic concepts ("the submit button in the bottom-right corner") directly map to spatial features in the visual encoding.
Action Prediction. The action head produces structured outputsβclick coordinates, text input, keyboard shortcuts, scroll operationsβgrounded in the visual scene. Critically, actions are predicted from the model's internal visual representation, not from external element identifiers. This means the same "click the blue submit button" task executes correctly regardless of whether the button's DOM ID changed, its CSS class was renamed, or it moved 50 pixels to the right in a redesign.
The unified architecture means these three capabilities share gradient flow during training. Visual features that help action prediction get reinforced; language representations that improve visual grounding get strengthened. This is fundamentally different from pipeline architectures where each component is optimized independently.
Mano-P's training follows a carefully designed progression that mirrors how humans learn complex tasks:
Stage 1: Supervised Fine-Tuning (Behavior Cloning). The model learns from expert demonstrationsβrecorded sequences of (screen state, reasoning, action) tuples collected from human operators completing real tasks. This establishes baseline competency: the model learns what correct action sequences look like for common workflows. However, behavior cloning alone produces a model that imitates the mean of demonstrations without understanding why certain actions are better than others.
Stage 2: Offline Reinforcement Learning (Advantage Learning). Using pre-collected trajectories (both successful and failed), the model learns to distinguish good actions from bad ones without additional environment interaction. The advantage function estimates how much better a particular action is compared to the average policy at that state. This stage is critical for sample efficiencyβit extracts maximum learning signal from existing data before expensive online exploration. The model learns failure recovery patterns: what to do when a click misses, when a dialog appears unexpectedly, when a page loads slowly.
Stage 3: Online Reinforcement Learning (Environment Interaction). The model interacts with live environments (real operating systems, real applications) and receives reward signals based on task completion. This stage handles the distribution shift between demonstration data and real-world conditionsβapplications update, screen resolutions vary, timing differs. Online RL fine-tunes the policy to handle edge cases that never appeared in demonstrations, producing robust behavior under novel conditions.
This three-stage pipelineβSFT β Offline RL β Online RLβprogressively builds from imitation to understanding to adaptation. Each stage addresses a specific limitation of the previous one.
Unlike open-loop systems that predict an action and immediately move to the next step, Mano-P implements a closed-loop mechanism:
βββββββββββ βββββββββββ βββββββββββ ββββββββββββ
β THINK ββββββΆβ ACT ββββββΆβ VERIFY ββββββΆβ THINK β
β β β β β β β (next) β
β Reason β β Execute β β Confirm β β β
β about β β groundedβ β expectedβ β Continue β
β current β β action β β outcome β β or retry β
β state β β β β achievedβ β β
βββββββββββ βββββββββββ βββββββββββ ββββββββββββ
Think: The model generates explicit reasoning about the current screen state, the overall task progress, and what action should come next. This reasoning trace is not just for interpretabilityβit actively improves action quality by forcing the model to articulate its understanding before committing.
Act: Based on the reasoning, the model predicts and executes a grounded action (click, type, scroll, keyboard shortcut). Actions are specified in the visual coordinate space of the current frame.
Verify: After action execution, the model captures the resulting screen state and evaluates whether the expected outcome occurred. Did the button click actually navigate to the expected page? Did the text input appear in the correct field? If verification fails, the loop returns to THINK with updated context about the failure mode, enabling error recovery without human intervention.
This closed-loop architecture is what separates GUI agents from sophisticated screen scrapers. The verification step means Mano-P can handle the non-determinism of real desktop environmentsβnetwork latency, animation delays, unexpected popupsβwithout pre-programmed exception handlers.
Running a VLA model on consumer hardware requires aggressive inference optimization. Mininglamp Technology developed GSPruning (Geometric-Semantic Pruning) specifically for GUI agent workloads, addressing the unique challenge of pruning visual tokens while preserving spatial grounding accuracy.
Standard token pruning methods (attention-based, random dropping) catastrophically degrade GUI agent performance because they disrupt spatial relationshipsβthe model can no longer accurately predict where to click if tokens representing spatial structure are removed arbitrarily.
GSPruning solves this through two complementary mechanisms:
Anchor-Based Spatial Structure Preservation. The algorithm identifies "anchor tokens"βvisual tokens that serve as spatial reference points for the broader scene (window corners, toolbar boundaries, prominent UI landmarks). These anchors are never pruned, maintaining the geometric scaffold that enables accurate coordinate prediction. Remaining tokens are pruned based on redundancy with nearby anchors, ensuring spatial density stays uniform rather than creating gaps that distort coordinate mapping.
Semantic Outlier Detection. Tokens whose semantic content is highly atypical relative to their spatial neighborhood are preserved regardless of pruning pressure. A notification badge on an otherwise uniform toolbar, a highlighted menu item among gray siblings, an error message in a standard formβthese semantically salient tokens carry disproportionate task-relevant information. Standard importance-based pruning often removes them (they have low attention mass because they are atypical), but GSPruning explicitly protects them.
The combined effect: 2-3x throughput improvement with minimal accuracy degradation. On a MacBook Pro M5 Pro, this translates to approximately 80 tokens/second decode speedβfast enough for real-time interactive use without cloud dependency.
Mano-P's architecture includes a bidirectional data flywheel between the agent model and the action prediction component. Successfully completed tasks generate new high-quality training data for the action predictor; improved action prediction enables the agent to complete harder tasks, which generates even richer training data. This self-reinforcement mechanism means the model improves with deploymentβeach successful real-world task execution contributes to future capability, without requiring manual data collection or annotation.
The architectural advantages manifest in benchmark results:
| Benchmark | Mano-P (72B) | Comparison |
|---|---|---|
| OSWorld | 58.2% | 72B internal benchmark model |
| WebRetriever NavEval (Protocol I) | 41.7 | vs Gemini 2.5 Pro: 40.9, Claude 4.5 Sonnet: 31.3 |
The open-source release is a 4B parameter modelβdeliberately sized for on-device deployment rather than maximum benchmark scores. The WebRetriever Protocol I result of 41.7 on NavEval demonstrates that Mano-P outperforms Gemini 2.5 Pro (40.9) and significantly exceeds Claude 4.5 Sonnet (31.3) on real-world web navigation tasks.
Running a VLA model locally requires more than model architecture innovationβit demands inference engine optimization at the hardware level. Mininglamp Technology's open-source Cider SDK provides production-grade quantization specifically tuned for Apple Silicon's Unified Memory Architecture (UMA).
W8A8 and W4A8 Activation Quantization. Cider implements weight-and-activation quantization (not weight-only) that exploits Apple Silicon's hardware integer units. W8A8 (8-bit weights, 8-bit activations) achieves approximately 12.7% prefill speedup with negligible accuracy loss. W4A8 (4-bit weights, 8-bit activations) pushes further for memory-constrained deployments.
1.4-2.2x End-to-End Speedup. Across different model configurations and hardware targets, Cider delivers 1.4-2.2x throughput improvement over naive FP16 inference. Combined with GSPruning's 2-3x token throughput gain, the full stack achieves real-time GUI agent performance on consumer laptops.
UMA-Aware Memory Management. Unlike discrete GPU systems where data must cross PCIe boundaries, Apple Silicon's unified memory allows CPU and GPU to share the same physical memory. Cider's memory allocator exploits thisβmodel weights, KV cache, and visual features coexist in a single address space without copy overhead, reducing both latency and peak memory footprint.
The critical privacy implication: data never leaves the machine. Screen frames, task descriptions, reasoning traces, action sequencesβeverything stays in local memory. There is no telemetry, no cloud dependency for inference, no API calls that transmit screen content to external servers. For enterprises handling sensitive documents, financial data, or personal information, this is not a featureβit is a requirement.
The choice between RPA and GUI agents is not about "old vs new"βit is about matching the automation architecture to the problem characteristics:
| Dimension | RPA (Selector-Action) | GUI Agent (VLA) |
|---|---|---|
| Best for | Stable, high-volume, single-app workflows | Cross-app, UI-volatile, reasoning-required tasks |
| Failure mode | Silent breakage on UI change | Graceful degradation with error recovery |
| Maintenance | Linear-to-superlinear with bot count | Model update covers all tasks simultaneously |
| Cross-app | Requires explicit integration | Nativeβsame model operates any application |
| Speed | Millisecond actions (no reasoning) | Seconds per step (perception + reasoning) |
| Determinism | 100% deterministic (when working) | Probabilistic (verify loop adds reliability) |
| Setup cost | Per-workflow scripting | One model deployment, natural language tasks |
RPA remains optimal for stable, high-volume, latency-sensitive workflows within a single application that rarely updatesβpayroll processing in legacy systems, mainframe data entry, report generation from stable internal tools. These are problems where the rigidity of selector-based automation is a feature (guaranteed determinism) rather than a bug.
GUI agents excel where RPA structurally cannot: workflows spanning multiple applications, tasks requiring visual understanding of unstructured content, environments that update frequently, and scenarios where the automation must handle unexpected states gracefully.
The future likely involves hybrid deployments: RPA handles the stable, high-throughput inner loops while GUI agents manage the cross-application orchestration, exception handling, and dynamic adaptation layers. The architectures are complementary at the systems level, even as they compete at the individual task level.
Mano-P's open-source availability (Apache 2.0) and on-device architecture lower the barrier to evaluating where VLA-based automation fits within existing enterprise automation stacks. The 4B parameter open-source model runs on a MacBookβevaluation requires no cloud infrastructure, no API keys, no data leaving the organization.
G: