A State Machine Made of Pixels

13 min read

The awkward thing about automating a browser game is that, after a certain point, the browser stops being the application.

It is still there. You can still drive Chromium. You can still click buttons, wait for selectors, fill login forms, and listen to network requests. But once the real UI has loaded into a Unity canvas, the useful application state is no longer in the DOM in any meaningful way. There is no nice selector for “the clan gifts tab is active”. There is no structured event for “the first visible row changed after opening a gift”. There is just a rectangle of pixels.

That is the problem behind Screeny’s chest-counter runtime, merged in PR #236 on April 16, 2026. The feature sounds simple enough: open Total Battle, navigate to the clan gifts screen, capture gift rows, open them, OCR the row crops, and repeat on a schedule. But the interesting part was not the OCR. The interesting part was building a small visual state engine around an opaque game UI.

This is not conventional DOM automation. It is closer to writing a state machine whose sensors are screenshots.

The first mistake: one big script

The tempting version of this kind of automation is a long imperative script:

open browser
login
click clan
click gifts
loop:
  screenshot row
  click open
  wait

That version is attractive because it matches how a human would describe the job. It is also fragile in exactly the ways browser-game automation tends to be fragile. If login triggers 2FA, the script is lost. If a marketplace popup appears over the canvas, the script keeps clicking underneath it. If the click lands but the list does not advance, the script has no idea whether the game is slow, the coordinate is wrong, or the screen is no longer where it thinks it is.

The working version ended up as layered state engines.

The outer layer is a browser preparation step engine. In chest-counter/src/flow.rs, the flow is expressed as explicit CollectorSteps: launch the browser, open the site, accept cookies, open login, submit login, wait for the Unity game, dismiss post-login popups, open the clan screen, open the gifts tab, and finally capture the list. Each step has actions, success markers, blocked markers, and a timeout.

That distinction matters. The step does not merely “click login”. It says what action to take, which anchors prove success, and which anchors mean the run is blocked. A visible email 2FA prompt is not the same failure as an unknown blocking popup. A reconnect dialog while waiting for the Unity client is not the same as a slow page load.

chest-counter/src/step_engine.rs then runs that plan. It captures trace screenshots around actions, polls anchor markers, handles stop requests, and gives the login challenge handler a chance to resolve 2FA before declaring the run blocked. That turns the browser preparation phase into something observable. A failed run has evidence attached to the step where it failed.

The second layer is the in-canvas Total Battle flow in chest-counter/src/browser.rs. That one has its own smaller state machine: CityIdle, OpeningClan, ClanModalOpen, OpeningGifts, GiftsScreenOpen, CapturingGifts, Done, and Error. This layer owns the Unity-specific work: canvas clicks, tab switching, row capture, “Open” button targeting, reward popup probes, and confirmation that the list actually advanced.

The third layer is the runtime worker in chest-counter/src/runtime_worker.rs. It turns the two lower layers into a service. It starts scheduled runs, retries failed runs immediately, honors safe stop/start transitions, persists timeline events, records artifacts, stores row metadata, and emits timing summaries. The worker is what makes this a runtime rather than a script you babysit.

That separation is the first lesson. Browser preparation, canvas navigation, and autonomous scheduling fail in different ways. Putting them behind one linear script would make every failure look like “the bot broke”. Splitting them into state engines gives each layer its own vocabulary for success, blockage, evidence, and recovery.

Pixels as state

Once the game is inside a canvas, screenshots become the API.

The collector has to infer state from evidence like this:

  • Is the gifts panel visible?
  • Which collection tab is active?
  • Is the list empty?
  • Where are the visible row bands?
  • Where is the green Open button inside the current row?
  • Did the screen change after the click?
  • Did a reward popup appear?
  • Does the next captured first row match the previous post-click first row?

None of those are DOM questions. They are image questions.

The robust version is dynamic scanning. The collector can crop the list panel and detect row bands by projection activity. It can search for the largest green connected component in the row’s open-button area. It can compare before/after crops with a visual change score. It can hash cropped regions and compare signatures. It can detect active tabs, empty states, and reward popups from image evidence.

This is resilient because it keeps asking the screen what actually happened. If a row is a few pixels taller than expected, dynamic row detection can still find it. If an Open button is shifted, component detection can target the green button instead of trusting a stale coordinate. If a click is swallowed, the before/after probes can say “no meaningful state change happened” instead of blindly moving on.

But dynamic scanning has a cost. Every broad screenshot has to be captured by the browser, encoded, transported back from Chromium, decoded, cropped, scanned, hashed, and compared. In the hot path, that cost compounds quickly. The expensive part is not only the image algorithm. It is the screenshot pipeline around the algorithm.

That is easy to underestimate. You look at a helper called visual_change_score or detect_row_regions and think about CPU. But the real loop may be doing repeated CDP screenshot captures, waiting between probes, and decoding images just to decide whether one click worked.

For a one-off debugging run, that is fine. For an autonomous collector that may open hundreds of rows, it is the difference between a plausible service and a sluggish science project.

The scary fast version: calibrated pixels

The opposite approach is calibrated pixels.

Instead of searching the whole UI for the gifts tab, configure the known tab click points. Instead of inferring the active tab from a broad region, sample a tiny swatch inside each tab body. Instead of rediscovering first-row geometry every time, configure the first row crop and row pitch. Model the coordinates against a 1920x1080 viewport and scale them to the runtime screenshot size.

This feels wrong at first. Hardcoded pixels are the sort of thing people make fun of in automation code, often for good reason. They are brittle. They depend on viewport, scale, UI layout, language, theme, and whatever the game designers changed this week.

But calibrated pixels are also extremely fast. A fixed Unity click point does not need image search. A 24x14 tab swatch is cheaper than analyzing a broad tab body. A known first-row crop is cheaper than rediscovering every candidate row. If the hot path is “open the next gift row and prove that it advanced”, small probes beat wide scans.

The Screeny implementation leans into this, but with an escape hatch. The README documents the calibrated tab swatches and click points. chest-counter/src/config.rs has typed configuration for CanvasPoint, CaptureRegion, GiftRowCaptureConfig, GiftsTabSwatchConfig, and the screenshot capture settings. If the UI drifts, the collector can be run with richer artifact policies, inspect the saved gifts-tab-switch-* captures, and recalibrate the boxes.

That makes the brittleness operational rather than mysterious. A stale coordinate is still a problem, but it is a problem with saved screenshots, labels, timing metadata, and a known recalibration path.

The useful compromise

The design I like most here is the hybrid.

The collector uses pinned coordinates for the hot path, but it does not blindly trust them. A click is only useful if a small amount of image evidence confirms the transition.

For tab switching, configured primary and fallback click points can replace broad candidate search. After the click, tiny tab swatches confirm which tab is active. The detector compares the swatches and treats the darker one as active, falling back to broader fixture/hash/luma checks when needed.

For row collection, the collector captures the row crop before opening it, chooses an Open click point, dispatches the canvas click, and then waits for proof. That proof can come from a reward popup probe or from list/row identity changes. The runtime stores which click point was selected, which point was accepted, how many click attempts were made, how long confirmation took, and which confirmation mode accepted the change.

There is an especially practical optimization in the change checkpoint logic. Instead of taking separate screenshots for every proof crop, the collector can compute one enclosing checkpoint region, capture it once, and crop the smaller regions in memory. That keeps the evidence model but reduces browser screenshot round-trips.

This is the core engineering lesson: do the minimum visual work that can prove the transition.

Not the minimum work that can probably get away with it. Not the maximum work that makes the system feel clever. The minimum work that can prove the state transition you are about to depend on.

In this system, that means:

  • use hardcoded coordinates where the UI is stable enough and the payoff is high;
  • use tiny swatches for cheap state confirmation;
  • use cropped probes rather than full-frame analysis when possible;
  • keep dynamic scanning where it buys safety, especially around row detection and fallback targeting;
  • persist enough artifacts and timing metadata to tune the calibrated parts later.

That last point matters. Fast automation without artifacts is a future debugging trap. Debuggable automation can afford to be more aggressive, because when it fails you can see what it thought it saw.

Screenshot format is part of the algorithm

One of the more humbling performance lessons was that screenshot format mattered.

The collector originally treated browser screenshots as PNG files by default. That is a reasonable instinct: PNG is lossless, and the row crops eventually feed OCR, so you do not want compression artifacts in the data you are trying to read.

But the browser screenshot path is also in the hot loop. Full-frame and probe screenshots are not just data. They are transport. Chromium has to encode them. The automation process has to receive them. The Rust side has to decode them. When you do that repeatedly, the image format becomes part of the runtime.

PR #236 added CHEST_COUNTER_BROWSER_SCREENSHOT_FORMAT=png|jpeg, CHEST_COUNTER_BROWSER_SCREENSHOT_JPEG_QUALITY, and CHEST_COUNTER_BROWSER_SCREENSHOT_OPTIMIZE_FOR_SPEED. Moving browser screenshots from PNG to JPEG produced a substantial observed speed improvement on top of the state-machine work: each screenshot was roughly twice as fast, dropping from about 100 ms to about 50 ms.

That is the margin you feel in a visual state engine. If screenshots are in your inner loop, encoding and transport are in your inner loop too.

That does not mean “always use JPEG”. OCR row crops may still deserve lossless PNG, and debug artifacts may need to preserve exactly what the detector saw. The useful distinction is between image artifacts as records and screenshots as sensor reads. The former optimize for fidelity and inspectability. The latter also have to optimize for latency.

Runtime beats cleverness

The runtime layer is less glamorous than the vision logic, but it is what makes the system useful.

An autonomous collector needs more than a successful happy path. It needs to know when to start. It needs to retry after failure without manual intervention. It needs to stop safely in the middle of a run. It needs to leave behind enough timeline events to explain what happened. It needs to persist partial row captures before clicking Open, because after a successful open the row may be gone.

That last bit is a good example of runtime thinking. The system stages the row before opening it, then updates metadata after the open is confirmed. If the run dies after the click, the pre-open evidence is not lost. If the next row is ambiguous, the runtime can mark it for manual review instead of pretending the OCR queue has perfect input.

The worker also records timing summaries: average row cycle, average post-click confirmation duration, reward-probe detection offsets, click attempts, and an estimated game-week capacity from the observed row cycle. Those numbers are not just vanity metrics. They are how you decide whether a change actually made the collector faster or merely moved the waiting around.

What I would carry to the next project

The general pattern is not specific to Total Battle.

Any opaque UI automation problem has the same fork in the road. You can scan dynamically and pay for robustness. You can calibrate pixels and pay for brittleness. Or you can build a hybrid where fixed coordinates drive the hot path and small probes prove that each transition happened.

I would reach for dynamic scanning when:

  • the UI moves often;
  • the action is rare enough that performance does not matter much;
  • the cost of a false click is high;
  • I do not yet know the stable geometry.

I would reach for calibrated pixels when:

  • the same interaction happens hundreds of times;
  • the viewport can be controlled;
  • there is a cheap visual confirmation after the click;
  • failures produce artifacts that make recalibration straightforward.

The Screeny chest-counter runtime ended up in the middle. It is not a pure computer-vision system. It is not a pile of magic coordinates. It is a service built around visual state: explicit browser steps, a canvas flow state machine, scheduled runtime control, calibrated hot-path inputs, and enough image evidence to keep the whole thing honest.

That is the shape I trust most for this class of problem.

When the application state only exists as pixels, do not pretend you are still automating the DOM. Build the state machine where the state actually is.

Source pointers

The implementation discussed here came from Screeny PR #236, feat: add autonomous chest-counter service runtime. The main files to inspect are:

  • chest-counter/src/flow.rs for CollectorStep, actions, anchors, success markers, and blocked markers;
  • chest-counter/src/step_engine.rs for step execution, trace screenshots, stop handling, and 2FA recovery;
  • chest-counter/src/browser.rs for the Total Battle canvas state machine, tab switching, row capture, probes, dynamic scanning, and calibrated click paths;
  • chest-counter/src/runtime_worker.rs for scheduling, retries, safe stop/start, timeline events, artifacts, metadata, and timing summaries;
  • chest-counter/src/config.rs and chest-counter/README.md for calibrated coordinates, tab swatches, artifact policy, and screenshot format knobs.