Visual State Engines for Opaque UIs
11 min readThe awkward thing about automating some modern browser applications is that, after a certain point, the browser stops being the application.
It is still there. You can still drive Chromium. You can still click buttons, wait for selectors, fill forms, and listen to network requests. But once the real interface has loaded into a canvas, a remote desktop stream, a game engine, a video surface, or a heavily custom renderer, the useful application state may no longer exist in the DOM in any meaningful way.
There is no selector for “the correct tab is active”. There is no structured event for “the list advanced after that click”. There is no API response that tells you whether the modal you care about appeared. There is just a rectangle of pixels.
That changes the shape of the automation problem.
This is not conventional DOM automation. It is closer to writing a state machine whose sensors are screenshots.
The Wrong Shape
The tempting version of this kind of automation is a long imperative script:
open browser
log in
click the navigation item
click the target tab
loop:
screenshot the row
click the action button
wait
That version is attractive because it matches how a human would describe the job. It is also fragile in exactly the way pixel-driven automation tends to be fragile. If login triggers a challenge, the script is lost. If an unexpected popup appears over the canvas, the script keeps clicking underneath it. If the click lands but the UI does not advance, the script has no idea whether the application is slow, the coordinate is wrong, or the screen is no longer where it thinks it is.
The working shape is usually not one big script. It is layered state engines.
One layer prepares the browser and account session. It knows about browser launch, navigation, login, consent banners, challenges, blocking popups, and whether the rendered app is ready. Another layer owns the opaque UI itself: canvas clicks, visual anchors, active tabs, list rows, button targeting, popup detection, and proof that a state transition happened. A third layer turns the whole thing into a runtime: scheduling, retries, safe stop/start, timeline events, artifacts, metrics, and operational recovery.
That separation matters. Browser preparation, visual navigation, and autonomous operation fail in different ways. If they are collapsed into one linear script, every failure looks like “the bot broke”. If they are separate state engines, each layer gets its own vocabulary for success, blockage, evidence, and recovery.
In practice, this means every step should answer four questions:
- What action am I about to take?
- What evidence proves it worked?
- What evidence means I am blocked?
- What artifacts should I keep if I am wrong?
Without those answers, the automation is mostly hoping.
Pixels as State
Once the useful interface is opaque, screenshots become the API.
The system has to infer state from evidence like this:
- Is the target panel visible?
- Which tab or mode is active?
- Is the list empty?
- Where are the visible row bands?
- Where is the action button inside the current row?
- Did the screen change after the click?
- Did a confirmation popup appear?
- Does the next captured row match what the previous click should have produced?
None of those are DOM questions. They are image questions.
The robust version is dynamic scanning. You crop meaningful regions and detect row bands by projection activity. You search for colored connected components that look like buttons. You compare before and after crops with visual change scores. You hash regions and compare signatures. You detect active tabs, empty states, and confirmation popups from image evidence.
This is resilient because it keeps asking the screen what actually happened. If a row is a few pixels taller than expected, dynamic row detection can still find it. If a button shifts, component detection can target the button instead of trusting a stale coordinate. If a click is swallowed, the before/after probes can say “no meaningful state change happened” instead of blindly moving on.
But dynamic scanning has a cost.
Every broad screenshot has to be captured by the browser, encoded, transported back to the automation process, decoded, cropped, scanned, hashed, and compared. In the hot path, that cost compounds quickly. The expensive part is not only the image algorithm. It is the screenshot pipeline around the algorithm.
That is easy to underestimate. You look at a helper called something like visual_change_score and think about CPU. But the real loop may be doing repeated screenshot captures, waiting between probes, and decoding images just to decide whether one click worked.
For a one-off debugging run, that may be fine. For a runtime that needs to repeat the same operation hundreds or thousands of times, it is the difference between a plausible service and a sluggish science project.
Calibrated Pixels
The opposite approach is calibrated pixels.
Instead of searching the whole UI for a known tab, use a pinned click point. Instead of inferring active state from a broad region, sample a tiny swatch inside each tab body. Instead of rediscovering list geometry every time, configure the first row crop and row pitch. Model the coordinates against a known viewport and scale them to the runtime screenshot size.
This feels wrong at first. Hardcoded pixels are the sort of thing people make fun of in automation code, often for good reason. They are brittle. They depend on viewport, scale, layout, language, theme, and whatever the product designers changed this week.
But calibrated pixels are also extremely fast.
A fixed click point does not need image search. A tiny tab swatch is cheaper than analyzing a broad tab body. A known first-row crop is cheaper than rediscovering every candidate row. If the hot path is “open the next row and prove that it advanced”, small probes beat wide scans.
The trick is to make the brittleness operational rather than mysterious.
Calibrated systems need:
- a controlled viewport model;
- explicit coordinate and region configuration;
- debug artifacts that show what the automation saw;
- timing metadata for slow or ambiguous transitions;
- a fast recalibration path when the UI shifts.
That makes a stale coordinate a normal maintenance event instead of a haunted failure. The coordinate is still brittle, but the system tells you which visual assumption broke and gives you the evidence needed to fix it.
The Useful Compromise
The design I trust most is the hybrid.
Use calibrated coordinates for the hot path, but do not blindly trust them. A click is only useful if a small amount of image evidence confirms the transition.
For tab switching, a primary and fallback click point can replace broad candidate search. After the click, tiny swatches can confirm which tab is active. If the swatch evidence is ambiguous, a broader region comparison can still act as fallback validation.
For row-oriented flows, the system can capture the row before taking action, choose a click point, dispatch the click, and then wait for proof. That proof might come from a confirmation popup, a list-region change, a row-identity comparison, or a hash match against the row that should have shifted into place.
There is an especially practical optimization here: avoid taking separate browser screenshots for every proof crop. Compute one enclosing checkpoint region, capture it once, and crop the smaller regions in memory. That keeps the evidence model but reduces screenshot round-trips.
This is the core engineering lesson: do the minimum visual work that can prove the transition.
Not the minimum work that can probably get away with it. Not the maximum work that makes the system feel clever. The minimum work that can prove the state transition you are about to depend on.
In a good hybrid system:
- fixed coordinates handle the common path;
- tiny swatches confirm cheap state;
- cropped probes prove transitions;
- dynamic scanning remains available where drift is likely;
- artifacts and timing metadata make recalibration routine.
Fast automation without artifacts is a future debugging trap. Debuggable automation can afford to be more aggressive, because when it fails you can see what it thought it saw.
Screenshot Format Is Part of the Algorithm
One of the easiest performance lessons to miss is that screenshot format matters.
It is natural to start with PNG files. PNG is lossless, and if cropped images eventually feed OCR or visual matching, you do not want compression artifacts in the data you are trying to read.
But screenshots in the hot loop are not just data. They are transport. The browser has to encode them. The automation process has to receive them. The runtime has to decode them. When you do that repeatedly, the image format becomes part of the algorithm.
In one recent visual automation loop, moving browser screenshots from PNG to JPEG made each screenshot roughly twice as fast, dropping from about 100 ms to about 50 ms. That is not a micro-optimization when screenshot capture sits inside every click-confirmation cycle. It changes the feel of the whole state engine.
That does not mean “always use JPEG”. OCR crops, audit artifacts, and failure snapshots may still deserve lossless PNG. The useful distinction is between image artifacts as records and screenshots as sensor reads. Records optimize for fidelity and inspectability. Sensor reads also have to optimize for latency.
If screenshots are in your inner loop, encoding and transport are in your inner loop too.
Runtime Beats Cleverness
The runtime layer is less glamorous than the vision logic, but it is what makes the system useful.
An autonomous visual collector needs more than a successful happy path. It needs to know when to start. It needs to retry after failure without manual intervention. It needs to stop safely in the middle of a run. It needs to leave behind enough timeline events to explain what happened. It needs to persist partial evidence before taking destructive or irreversible actions, because after a successful click the thing you meant to inspect may be gone.
That last point is important. In visual automation, the evidence often exists only before the action. Capture first, act second, confirm third. If the run dies after the click, the pre-action evidence should not be lost. If the next state is ambiguous, the runtime should be able to mark it for review instead of pretending the input was perfect.
The runtime should also measure itself:
- average operation cycle time;
- average post-click confirmation time;
- screenshot capture latency;
- retry counts;
- fallback path usage;
- ambiguous-state rate;
- artifact volume;
- estimated throughput under current timing.
Those numbers are not vanity metrics. They are how you decide whether a change actually made the system faster or merely moved the waiting around.
Where This Pattern Applies
This pattern is not specific to games. It shows up anywhere the useful state is rendered but not exposed:
- browser apps that hide their real interface in a canvas;
- remote desktop or VNC automation;
- streamed enterprise software;
- kiosk-style web views;
- legacy applications with poor accessibility hooks;
- visual QA systems;
- hardware dashboards rendered through video capture;
- any workflow where the only reliable source of truth is the screen.
All of these systems have the same fork in the road. You can scan dynamically and pay for robustness. You can calibrate pixels and pay for brittleness. Or you can build a hybrid where fixed coordinates drive the hot path and small probes prove that each transition happened.
Reach for dynamic scanning when:
- the UI moves often;
- the action is rare enough that performance does not matter much;
- the cost of a false click is high;
- you do not yet know the stable geometry.
Reach for calibrated pixels when:
- the same interaction happens many times;
- the viewport can be controlled;
- there is a cheap visual confirmation after the click;
- failures produce artifacts that make recalibration straightforward.
The practical design usually lands in the middle. It is not a pure computer-vision system. It is not a pile of magic coordinates. It is a service built around visual state: explicit steps, visual anchors, runtime control, calibrated hot-path inputs, and enough image evidence to keep the whole thing honest.
When the application state only exists as pixels, do not pretend you are still automating the DOM. Build the state machine where the state actually is.