I Wanted to See What Claude Could Do Inside GitHub Actions. So I Gave It a Package Upgrade.

Claude opened the PR, responded to every CodeRabbit comment with inline replies, pushed fixes, and posted a migration report — without a developer sitting in the loop once.

That was the experiment: wire Claude directly into GitHub Actions as an autonomous agent, give it a real task with real stakes, and see where it broke down. Non-backward compatible package upgrades were the test vehicle — well-defined scope, grep-able, verifiable, enough moving parts to stress-test the edges.

formidable v1.1.1 → v3.5.3 was the package.

Why package upgrades make a good test for autonomous agents

The manual process for a non-backward-compatible upgrade looks like this:

Read the migration guide and enumerate every breaking change
Grep the codebase — find every usage of every changed API
Write tests covering current usage before touching anything
Apply changes, one breaking change at a time
Re-run tests, chase regressions that weren't in the original baseline
Open a PR, get reviewed, respond to every comment

For formidable v1 → v3 — two major versions, a rewritten API surface — that's easily a sprint for one engineer. Every step is well-defined, grep-able, and verifiable. That's exactly what makes it a good test: if an agent can do this end-to-end reliably, it's doing something real.

The question was whether Claude, running headless in GitHub Actions, could handle all of it.

The idea: human at the edges, agents in the middle

The constraint we set before writing a line of config: agents never merge. A human opens the PR with domain context. A human reviews the result and decides to merge. Everything between those two actions — codebase analysis, test writing, migration, code review, review response — agents handle.

The result is a 5-phase pipeline:

Phase 0 (Human) → Phase 1 (Claude) → Phase 2 (CodeRabbit) → Phase 3 (Claude) → Phase 4 (Human)

Agentic Package Upgrades — full pipeline architecture

Phase 0 — Human Preparation

The human's job is to produce a breaking-change document. For formidable, this means enumerating every API change: old signature, new signature, ripgrep pattern to find usages, test scenarios to validate the migration.

Then open a PR with a metadata table:

| Package | From | To | Breaking-change doc | |---|---|---|---| | formidable | 1.1.1 | 3.5.3 | docs/breaking-changes/formidable/1.1.1-to-3.5.3.md |

This table is parsed by CI guard jobs. It's the only human-provided input that bootstraps the entire autonomous pipeline. The domain knowledge lives here — no agent can infer which usage patterns are safe vs. breaking without it.

Phase 1 — Claude: Autonomous Upgrade Cycle

Claude runs 7 steps in sequence:

A. Read context — CLAUDE.md, repo-context.md, the breaking-change doc. Claude needs to understand the repo conventions before touching anything.

B. Baseline tests — Run npm test, record every pre-existing FAIL. These are excluded when validating Phase 1's changes — Claude only flags new regressions, not pre-existing noise.

C. Impact analysis — Ripgrep the entire codebase against every pattern in the breaking-change doc. Classify each usage:

🔴 BREAKING — directly uses the changed API, will break without migration
🟠 INDIRECT — depends on something that uses the changed API
🟢 SAFE — no dependency on the changed API

D. Write tests — Cover every BREAKING usage before modifying source. This is non-negotiable. If tests don't exist, Claude creates them. Migration without a test suite is just hoping.

E. Apply migration — Smallest possible changes. One commit per breaking change. No refactoring, no cleanup — just migration.

F. Validate — Rerun tests, diff against the baseline recorded in step B. Only NEW failures are flagged.

G. Migration report — Structured checklist posted as a PR comment. Every BREAKING usage accounted for, every test result summarized, every commit annotated.

The output is a PR with the migration complete and the report attached. Human can re-trigger Claude with an @claude comment if it stalls at any step.

Phase 2 — CodeRabbit: Automated Code Review

CodeRabbit reviews all of Claude's changes automatically. Every inline comment is classified:

BLOCKING — must be fixed before merge
SUGGESTION — should probably be addressed
NITPICK — minor, low priority
FALSE_POSITIVE — CodeRabbit flagged something that isn't actually a problem

CodeRabbit's pull_request_review event fires when the review completes. That event triggers Phase 3.

Phase 3 — Claude responds to CodeRabbit (the part that required custom work)

No built-in solution exists for this. GitHub doesn't give you "Claude responds to CodeRabbit" out of the box. We built a separate GitHub Action workflow — upgrade-review-responder.yml — that:

Triggers on CodeRabbit's pull_request_review event
Fetches all unaddressed CodeRabbit inline comments across all reviews
Passes them to Claude with a classification prompt
Claude acts on each comment:
- BLOCKING / SUGGESTION → fix the code, commit, push, post inline reply with the commit SHA
- NITPICK / FALSE_POSITIVE → post a reasoned inline reply explaining why no change was made
Posts a summary of all actions taken as a PR-level comment

Two things that took iteration to get right:

GitHub API pagination. The API returns 30 comments per page. Without explicit pagination across all pages, Claude was fetching only the most recent 30 — which meant stale reviews from earlier rounds were ignored, and already-resolved comments were being re-evaluated. Had to add explicit pagination to the workflow and inject all pages into the prompt.

Inline replies vs. general comments. Claude's default behavior was to post general PR comments (/pulls/{id}/comments) instead of threading replies to the specific CodeRabbit comment that triggered the response. The fix was explicit: the workflow prompt specifies the pulls/comments/{id}/replies endpoint, and Claude uses the original comment's ID to thread the reply. Without this, the conversation becomes unreadable — general comments pile up at the bottom, disconnected from the review thread.

Guardrail: The Claude ↔ CodeRabbit cycle is capped at 3 rounds, tracked via PR label. If the PR is still unresolved after 3 rounds, human intervention is required. This is intentional.

Phase 4 — Human Merge Decision

The human reviews:

Claude's migration report (every breaking change addressed, every test result)
CodeRabbit's final resolution state
Claude's Phase 3 summary (what was fixed, what was marked false positive and why)

If it looks good: merge. If not: intervene with comments or /phase2 to trigger another Claude review response cycle.

Merge is always manual. The system cannot push to main.

The tradeoffs we accepted

Create-only. Agents can open PRs and push commits, but they can't merge. This was a deliberate constraint, not a technical limitation. Autonomous merging removes the human review step that catches things agents miss. The constraint adds friction — a human has to be in the loop at the end — but that friction is the point.

Headless CI, not Agent Teams. Claude's Agent Teams feature lets multiple Claude instances coordinate via shared task lists while a developer steers them. Our workflow is different: it runs in GitHub Actions with no human present during execution. The cross-system part — Claude fetching and responding to CodeRabbit comments via the GitHub API — is something Agent Teams don't cover. We had to build the orchestration layer ourselves. They're complementary, not competing.

Max 3 review cycles. If Claude and CodeRabbit can't resolve a comment in 3 rounds, something needs a human. This cap exists to prevent infinite loops on genuinely ambiguous review comments. In our POC runs, every cycle resolved within 2.

Breaking-change doc is mandatory. The pipeline doesn't work without it. Claude reads the doc to know what patterns to grep for, what the new API looks like, and what test scenarios to write. Skipping the doc means Claude has to infer breaking changes from the release notes and package diffs — which it can do, but unreliably. The doc is the human's domain knowledge contribution. It's worth the 30 minutes.

What the result looks like

PR #439 in cap-creatives-api: formidable v1.1.1 → v3.5.3, fully migrated.

Every breaking API usage identified and migrated
Tests written before migration, re-validated after
CodeRabbit review completed, every BLOCKING comment addressed with a commit and inline reply
Migration report showing the full before/after for every changed file
Zero unresolved review threads

Time from opening the PR (Phase 0) to merge-ready: one autonomous pipeline run. No engineer sitting in the loop during Phase 1-3.

What this pattern handles beyond package upgrades

We validated three future use cases during the POC:

CVE and security upgrades — same pattern, higher urgency. The breaking-change doc gets written from the CVE advisory. Everything else runs the same.

Repo health reports — Phase 1 (analysis + classification) runs without Phase 2-3 (migration + review). Output is a structured report of outdated dependencies, their breaking-change surface, and estimated migration effort.

Continuous documentation — Claude reads recent commits, diffs against existing docs, and posts a PR updating CLAUDE.md and repo-context.md. CodeRabbit reviews the doc changes. Same pipeline, different payload.

The thing nobody tells you about code review automation

The review response problem — Phase 3 — is the hardest part of this workflow to get right, and it's the part with no off-the-shelf solution.

Every code reviewer tool I've seen gives you the review. None of them close the loop: fetch the comments, evaluate them, fix what needs fixing, explain what doesn't, and thread the reply back to the original comment. That step is fully manual in every review workflow I've used. It's also the most cognitively expensive part for a developer: switch context, read comment, understand it, fix it, switch back, write reply, repeat 20 times.

What we built for Phase 3 is straightforward once you've mapped the GitHub API correctly. The decision criteria — what counts as BLOCKING vs. NITPICK vs. FALSE_POSITIVE — live in the prompt, not in hardcoded rules. That means the workflow is adjustable: if your codebase has patterns that CodeRabbit reliably misclassifies, you add that context to the prompt and Claude starts handling it correctly.

The orchestration isn't magic. It's GitHub API calls, a Claude prompt with classification rules, and inline reply threading. But because nobody had put those three things together in a headless CI workflow before, we had to build it ourselves — which meant hitting every edge case the hard way.

Use this yourself

The full workflow is open-sourced: github.com/harry-harish/agentic-package-upgrades

The repo contains the two GitHub Actions workflow files (upgrade-cycle.yml and upgrade-review-responder.yml), a breaking-change doc template, and a CLAUDE.md scaffold for the upgrade agent context.

What you need

Accounts and keys

Anthropic API key — Claude runs as a GitHub Actions step via the anthropics/claude-code-action
CodeRabbit installed on your repo — free tier covers this
GitHub Actions enabled

Required GitHub secrets

ANTHROPIC_API_KEY     # Claude API key
GITHUB_TOKEN          # already available in Actions — no manual setup

How it works in your repo

Write the breaking-change doc — the repo includes a template at docs/breaking-changes/TEMPLATE.md. Fill in: package name, old/new versions, every changed API with ripgrep patterns and test scenarios.
Open a PR with this metadata table in the description:

| Package | From | To | Breaking-change doc |
|---|---|---|---|
| your-package | 1.x | 2.x | docs/breaking-changes/your-package/1.x-to-2.x.md |

CI picks it up — upgrade-cycle.yml parses the table, passes the doc to Claude, and runs the full upgrade cycle (impact analysis → test writing → migration → validation → report).
CodeRabbit reviews Claude's changes automatically.
upgrade-review-responder.yml triggers on CodeRabbit's review event, fetches all inline comments, and Claude responds — fixing BLOCKING/SUGGESTION items, explaining NITPICK/FALSE_POSITIVE calls with inline replies.
You review and merge — or re-trigger with an @claude comment if anything needs another pass.

What to customize

The decision criteria in Phase 3 live entirely in the prompt inside upgrade-review-responder.yml. If CodeRabbit reliably misclassifies certain patterns in your codebase, add a note to that prompt — Claude will handle them correctly from the next run.

The 3-round cap is tracked via PR label (upgrade-cycle-round-1, etc.) and is adjustable in the workflow config.