Harish Kumar
war-storygaparchitect-mindset

The Test Failed. We Had Nothing to Debug It With.

How I built CapVision in a day — a UI automation observability platform that embeds session recordings, console logs, and network traces directly into WDIO test reports — because CI failures were costing us 4–6 hours each and we had nothing to debug them with.

December 20, 20259 min

The CI pipeline sends a failure notification. You open the report. It tells you the test name. It tells you the assertion that failed. It tells you the line number.

It tells you nothing about what the UI was actually doing.

No recording. No console output. No network trace. The report is a box score for a game nobody filmed. You know who lost. You have no idea why.

So you do what every engineer on every team in this situation does: you re-run it locally and hope you can reproduce the failure. If you can, you debug it. If you can't — if it was a race condition, a flaky animation state, a timing window that doesn't reproduce on your machine — you're debugging a ghost. You open Slack, ask if anyone else saw it, and file it as "intermittent" until it starts failing consistently enough to force action.

That cycle was costing roughly $2.88M a year in direct engineering time across the org. 4–6 hours per failure on average. 40% of fix attempts failing because they were based on hunches rather than evidence. 35% of failures requiring multiple engineers.

Nobody had written a ticket for this problem. It wasn't in anyone's sprint. It was just the cost of doing automation the normal way.

I decided to stop paying it.


The gap that nobody names

Functional automation test suites do exactly what they're designed to do: they verify that the application behaves correctly at the assertion level. Pass or fail. The test either got the expected output or it didn't.

What they don't do is capture what happened between the start of the test and the assertion. The UI state. The console errors. The network calls that failed silently. The animation that blocked a click. The race condition that resolved differently in CI than on your laptop.

This gap exists because it's not what test frameworks are built for. WDIO, Playwright, Cypress — they're assertion engines, not observability tools. The failure tells you where the code disagreed with the expectation. The report doesn't tell you what the application was doing when it disagreed.

The result is that every failed automation test triggers an investigation rather than a diagnosis. You're not debugging with evidence. You're reconstructing a scenario from memory and hope. 60% confidence, according to the data — which means 40% of the time you're fixing the wrong thing.

The problem isn't that tests fail. Tests are supposed to catch failures. The problem is that when they do, the artifact they leave behind is almost entirely useless for figuring out why.


What I didn't want to build

The obvious candidates already exist.

Percy, pixelmatch, and similar tools do visual regression testing — screenshot comparison between builds. That's a different problem. Screenshot diffing tells you that the UI looks different across versions, not what was happening when a specific test failed. It's a regression detection tool, not a debugging tool. And it requires running two versions of the app, not just one test run.

Log aggregation tools — Datadog, Elastic, similar — capture server-side and application logs. They don't capture what the browser was rendering. They don't show you the DOM state at the moment the test failed. And they require infrastructure that isn't embedded in a test report.

OS-level screen recording would give you video, but video files are large, unindexed, and unembeddable. You'd need a separate tool to view them, a storage backend to keep them, and a way to correlate the recording with the specific test failure. None of that integrates with the WDIO HTML report that engineers already open after a CI run.

The requirement I kept coming back to: whatever this tool produces needs to be in the report. Not linked from the report. Not generated alongside the report. Embedded in it. When an engineer opens the test results, the session replay, the console output, and the network log have to be right there, under the failed test. Zero additional tooling, zero additional URLs.

That constraint ruled out most of the existing ecosystem.


The architecture that fits the constraint

rrweb is a session recording library that works differently from screen capture. Instead of capturing pixels, it captures DOM mutations — every change to the document structure, CSS, attributes, text content. It serializes these as events, and a corresponding player library can reconstruct and replay the page state from those events, frame by frame, without the original application running.

That distinction matters for the embedding requirement. An MP4 file is 50–200MB. A DOM event log for a 60-second test run is a few hundred KB of JSON. You can inline JSON into an HTML file. You can't inline a video.

The integration model: rrweb runs in the browser during the test, capturing events continuously. When the test ends, the events are serialized to a temporary JSON file in /tmp. If the test failed, the recording is moved to the reports directory. When all tests complete, a report enhancer parses the WDIO HTML report, inserts a collapsible player section under each failed test row, and embeds the rrweb player with the recording inline. The final HTML file is completely self-contained — no external URLs, no dependencies, no server required to view it.

This is ReportEnhancer.js — 581 lines that parse the HTML report, insert <tr> rows with embedded player divs, and write the modified HTML back to disk.

Four vendored files, all pre-bundled:

| File | Size | Purpose | |---|---|---| | capvision.min.js | 137 KB | rrweb recording library | | capvision-player.min.js | 112 KB | rrweb replay library | | console-record.min.js | 124 KB | rrweb console plugin | | console-replay.min.js | 108 KB | rrweb console replay plugin |

Vendored because adding four npm dependencies to the package introduces lockfile churn, version resolution issues, and the possibility of upstream changes breaking a tool that engineers rely on to debug CI failures. The static assets are pinned to known-good versions and will not change unless someone deliberately updates them.


The network problem

Console capture was straightforward — the rrweb console plugin wraps console.log/info/warn/error with decorated functions that serialize arguments and emit them as rrweb events. All four log levels, up to 10,000 characters per entry, captured in the same event stream as the DOM recording.

Network capture was not straightforward.

The obvious approach — XHR/fetch interception — would require injecting JavaScript into the application under test. That creates coupling between the observability tool and the app, and it doesn't capture requests made before the intercept is set up.

Chrome DevTools Protocol (CDP) gives access to network events at the browser level, before JavaScript runs. WDIO already uses CDP internally; the browser.cdp() API exposes it to test hooks. CDP can subscribe to Network.responseReceived events and report on every HTTP response the browser processes.

The implementation only captures error responses — status ≥ 400. Successful requests aren't interesting for debugging, and capturing everything would bloat the recording with API calls that aren't relevant to the failure.

The key design decision: network errors are re-emitted as console.error calls in the browser context. This sounds counterintuitive — why would you report a network error through the console? Because rrweb is already capturing the console. Re-emitting through console.error means the network errors appear in the same event stream as the console logs, at the correct timestamp, without needing a separate UI component or a separate data structure. One unified timeline instead of three separate panels.


The WDIO integration

WDIO integrations come in two official forms: Reporter classes (which receive test lifecycle events and produce output) and Service classes (which can hook into the browser lifecycle). CapVision uses neither.

The integration is a hook factory: createWDIOCapVisionHooks() returns a plain object with four lifecycle methods. Engineers add them to their wdio.conf.js:

const hooks = createWDIOCapVisionHooks(recorderConfig, enhancerConfig);
exports.config = {
  onPrepare: hooks.onPrepare,     // clear old recordings folder
  beforeTest: hooks.beforeTest,   // init recorder, start rrweb, setup CDP
  afterTest: hooks.afterTest,     // stop recording, save if failed
  onComplete: hooks.onComplete,   // inject player into HTML report
};

Three lines of config change. No class inheritance. No plugin registration. No understanding of WDIO internals required.

The reporter and service patterns are the "right" way to extend WDIO, but they require understanding the extension architecture before you can write anything. The hook factory is more flexible — hooks are already a first-class concept in WDIO, every wdio.conf.js already has them, and the integration is explicit: you can see exactly what fires at each lifecycle point without reading plugin documentation.


What "visual regression mode" actually means

The feature is called "visual regression mode." It has nothing to do with visual regression.

There is no pixel diff engine. No Percy integration. No screenshot comparison. No pixelmatch. "Visual regression mode" means saveAllRecordings: true.

By default, CapVision only saves recordings for failed tests. Visual regression mode saves recordings for all tests — passed and failed — so engineers can manually review UI behavior across a suite run, spot visual inconsistencies, and catch regressions that functional assertions don't detect.

Activated via:

VISUAL_REGRESSION=true ENABLE_RECORDING=true npx wdio --suite visual-regression

The name is a product decision, not a technical one. "Visual regression" is what the team calls the use case — reviewing UI behavior across test runs to catch things that pass assertions but look wrong. The implementation is just a toggle. The framing works because the tool is doing what the team needs, even if the name implies a capability it doesn't have.


What we gave up

DOM replay isn't video. rrweb reconstructs page state from serialized DOM events — it doesn't capture pixel output. Canvas elements aren't recorded (recordCanvas: false). WebGL, video, and custom rendering contexts won't replay. For most application UI, this doesn't matter; for canvas-heavy interfaces it's a real gap. Input values are masked (maskAllInputs: true) to avoid capturing credentials and PII in recordings.

Network capture is errors-only. CDP sees everything, but we only record responses with status ≥ 400. If a test failure is caused by a successful API call returning incorrect data, the network log won't show it. You'd have to look at the console output if the application logs the response, or add an explicit assertion. This is the right tradeoff for the debugging use case — the 5% of failures caused by incorrect 200 responses don't justify the noise of capturing every request — but it's a real limitation.

Storage is local. Recordings are JSON files on the filesystem, embedded inline in the HTML report. There's no cross-team sharing unless you send the report file. There's no searchable history. There's no way to compare recordings across runs. These are real gaps for teams that want to build long-term observability — but they're out of scope for what this is: a debugging artifact embedded in the test report that engineers look at immediately after a CI failure.

"Visual regression" ≠ pixel diff. If a future engineer reads the feature name and expects screenshot comparison, they'll be confused. The name will need to change or be clarified as the tooling matures.


What happened

v1.1.0 of @capillarytech/cap-ui-dev-tools shipped on November 12, 2025 with CapVision included.

USCRM deployment followed — enabled on the USCRM cluster for live failure monitoring, running on nightly builds.

The session replay embeds directly in the WDIO HTML report. 836 events captured, 01:03 total duration — the full flow visible, including a toast notification mid-run and network activity at the exact moment of failure.

CapVision session replay panel embedded in the WDIO HTML report — 836 events, full DOM replay with network activity overlay

Console output and UI state captured simultaneously in the same session. The split view shows Chrome DevTools console on the left — stack traces, API calls, field logs — alongside the DOM replay on the right, at the same timestamp.

Split view: Chrome DevTools console alongside CapVision DOM replay — UI state and console output from the same moment in the test run

The unified timeline means you don't correlate logs with screenshots manually. The console event and the UI state it corresponds to are already in sync.

CapVision unified timeline — console overlay and DOM state at the same playback position, delete confirmation modal visible

The numbers from the first weeks of operation:

  • MTTR: 260 min → 20 min. A 93% reduction in mean time to resolution.
  • Complex failure diagnosis: 3–5 days → 2–4 hours. The failures that used to require escalation to senior engineers now resolve within a single session.
  • Failed fix attempts: 40% → under 10%. Fixes are based on what the recording shows happened, not what engineers think probably happened.
  • Session adoption: >80% of the team using session replay within weeks of deployment.
  • 1,000+ sessions recorded per week at stable USCRM usage.
  • Avg debug session: under 15 minutes. Not four hours. Not two days. Fifteen minutes.

In December, CapVision was presented at the Monthly Engineering Metrics showcase as an engineering achievement. The projected annual savings: $1.67M, based on 930 engineering hours recovered per month.

The peer recognition quote: "addresses the debugging challenges the UI team has been experiencing."

That quote is accurate. It also undersells it. What CapVision addresses is the structural assumption that failure diagnosis is an investigative task. With a session replay, it isn't. It's a review task. You watch what happened. You fix what you see.


The thing about unplanned work

CapVision wasn't in a sprint. There was no ticket. No stakeholder asked for it. No OKR referenced it.

The gap was visible if you were looking for it — every engineer who ran a failing CI test and had nothing to debug it with was experiencing it. But visible gaps don't automatically become work. They become background frustration that people learn to live with, work around, and stop expecting to be fixed.

The cost of $2.88M annually doesn't show up in any one person's workload. It's distributed across dozens of engineers spending 4–6 hours debugging failures that a recording would resolve in 15 minutes. No single instance is expensive enough to force a decision. The aggregate is.

What made CapVision possible was treating the problem as worth solving without waiting for permission. The PRD, the build, and the merge happened in the same day. Not because it was easy — the CDP routing decision and the report embedding approach both required real investigation — but because a scoped tool with a clear constraint (embed everything in the report) gives you a bounded problem. Bounded problems are solvable in a day when you're not spending time negotiating scope.

The gap didn't have a name before CapVision. It had a symptom: engineers spending half a day debugging what a 15-minute replay would have shown them.


The report that told you the test failed but not why is not an acceptable artifact for a mature automation suite. It's a starting point dressed up as a deliverable.

CapVision is what a deliverable looks like: the failure, the recording, the console, the network trace, in one file, ready to open. Not linked. Not generated separately. There.

The test still fails. At least now you know why.