Harish Kumar
war-storyarchitect-mindset

Every Node Service. One Upgrade. No Big Bang.

How we upgraded RabbitMQ 3.8 to 4.1 across three Node services using an aliased V1/V2 connector — making each service migration independent, rollback-capable, and boring at go-live.

January 1, 20269 min

Every Node service in the Engage platform routes messages through RabbitMQ. When RabbitMQ 3.8 hit end-of-life, that single fact turned a version bump into a platform-wide problem.

The naive path: upgrade the broker, update the SDK, redeploy everything at once, hope nothing breaks. If something does break in production, you're coordinating a rollback across every service simultaneously while engineers from multiple teams are on the call and users are watching queues drain.

We didn't do it that way.


The Blast Radius Problem

The Engage platform has three Node services consuming from RabbitMQ: nfs-node-api, arya, and cap-creatives-api. Different teams touch these services. Different release cadences. Different risk tolerance. A single shared infrastructure dependency tying all of them together.

RabbitMQ 4.1 introduced breaking changes in the SDK. You can't just swap the version string and ship it. The connector interface changed. Any service using the old connector against the new broker fails — not gracefully, not silently. It fails in ways that are immediately visible and immediately painful.

So the problem isn't "upgrade RabbitMQ." The problem is: how do you upgrade a shared infrastructure dependency across multiple services without requiring them all to move at the same time, and without leaving yourself unable to roll back if something goes wrong?


What we considered

Big bang cutover — upgrade the broker and all three services in a single coordinated deployment window. Technically feasible. Operationally high-risk: if anything breaks, you're rolling back across three services simultaneously while production queues drain. With three services owned by different teams on different release cadences, the coordination overhead alone is expensive.

Blue-green broker deployment — run parallel RabbitMQ 3.8 and 4.1 clusters, migrate services one at a time, decommission the old cluster when done. Eliminates the coordinated cutover risk but doubles infrastructure cost during the transition and requires cross-team agreement on a shared timeline.

Per-service independent migration with no shared tooling — each team handles their own SDK upgrade separately, on their own schedule. Minimal coordination, but no consistent rollback mechanism and no shared understanding of which services are on which version at any given moment.

Aliased V1/V2 connector with feature flags — the same package exports both connector versions. Teams import by name. Flags control which version routes traffic. Each migration is one flag flip; rollback is the same. Chosen because it gives the independence of per-service migration with the consistency of a shared tooling approach, and the flag means rollback doesn't require a deployment.


The Approach: Make V1 and V2 Coexist

The core idea is an aliased dependency.

AMQPConnectorV2 is built to run alongside AMQPConnectorV1. Same codebase, same package, two named exports. V1 targets RabbitMQ 3.8. V2 targets 4.1. Each consumer service imports the connector it needs by name. A feature flag controls which version routes its traffic. Flipping the flag is the migration. Flipping it back is the rollback.

No code change. No deployment. Just a flag.

This transforms the migration from a high-risk coordinated cutover into a sequence of low-risk independent moves. Migrate nfs-node-api this sprint, watch it for a week, then migrate arya, then cap-creatives-api. If anything goes sideways at any point, the rollback is immediate and doesn't require touching other services.

The aliased approach also forces something useful: it makes the migration an explicit, versioned software engineering problem rather than an ops task you do once and forget. You can read the connector import and know exactly which version of the broker that service is talking to.


What the Migration Found

Before any of this went to production, we ran integration tests against a live RabbitMQ 4.1 cluster. Not mocked. An actual cluster.

One test failed in a way that wasn't about the new connector at all. There was a legacy touchpoint — an integration that had its own non-standard way of initiating an RMQ connection, different from the standard connector pattern used everywhere else. That method broke against 4.1. Not because of the SDK changes, but because the initiation sequence it relied on wasn't valid anymore.

When we went to fix it, we found the touchpoint wasn't in active use. No live traffic. No service depending on it. It had been accumulating quietly as dead code.

The migration became the forcing function to formally retire it. We removed it, updated the documentation, and closed the loop.

This is something migrations do that routine maintenance doesn't: they create pressure to confront code that's been ignored because nothing was actively failing. The upgrade didn't just change versions. It cleaned up a debt that would have otherwise waited indefinitely.


Observability Before Traffic

The Grafana dashboard went up before we migrated any service. The Prometheus alerts were configured before any production traffic switched.

This is deliberate. If you wait until go-live to set up observability, you're going into a high-stakes moment without a baseline. You won't know what normal looks like. Every metric will be uninterpretable because you have nothing to compare it to.

The "RabbitMQ 4.1 Migration Dashboard" covers six rows:

  1. Cluster Health Overview — running nodes, total queues, connections, channels, memory used, disk free. The snapshot before go-live: 6 nodes, 31 queues, 3 connections, 24 channels, 804 MB memory, 83.9 GB disk free.
  2. Global Messages — aggregate message volume across the cluster.
  3. Message Publish Rate — per-queue, per-node.
  4. Message Consume Rate — per-queue, per-node.
  5. Message Confirm Rate — tracks broker confirmations.
  6. Message Ack Rate — tracks consumer acknowledgements.

Filters on release, vhost, node, queue, and queue type. The queue legend covers KOGNITIV_SCS and the full set of ark.migration.* queues.

We had full cluster visibility before we flipped the first feature flag. The go-live decision was made with a live dashboard already showing healthy baseline numbers, not with fingers crossed.

Prometheus alert rules were deployed on the same timeline. Any anomaly in publish rate, consume rate, or queue depth would page before it became a user-visible problem.


Where AI Did the Work

This migration was AI-led across every phase, not just in patches.

The Prometheus alert configuration and Grafana dashboard design were AI-assisted — from the metric selection to the panel layout to the threshold tuning. The AMQPConnectorV2 implementation was written with AI tooling. The test suite — both the unit tests hitting ≥95% coverage on AMQPConnectorV2 and the integration test harness targeting the live cluster — was built AI-first.

The 5 initiative documents that structured the work were also drafted with AI support:

  1. External library upgrade strategy — SDK diff, breaking changes, migration path
  2. Internal SDK strategy — AMQPConnectorV2 design and the V1/V2 coexistence rationale
  3. Consumer application migration guides — per-service runbooks for nfs-node-api, arya, cap-creatives-api
  4. Testing and monitoring strategy — coverage requirements, integration test scope, observability setup
  5. Production readiness checklist — go/no-go criteria, stakeholder sign-off process

Five documents is a lot for a version bump. But the value wasn't just for the go-live. The documents are what you hand to the next team that needs to migrate a consumer service that doesn't exist yet, or to the engineer who picks this up six months from now and needs to understand why V1 still exists.

Using AI to compress this work didn't lower the quality of the output. It raised the ceiling on how much could get done by one person in one quarter. The test coverage, the documentation depth, the observability — any one of these would normally be a tradeoff you make against the others because there isn't enough time. This time, there was.


What we gave up

Two connector versions in the codebase indefinitely. V1 stays alive until every service migrates — which means the package accumulates maintenance surface for both, and someone has to track which services are still on V1. If a team deprioritizes their migration, V1 debt doesn't disappear; it just sits there with a flag pointing at it.

The phased timeline is slower than a coordinated cutover would have been. A single weekend of coordinated work could have moved all three services at once. The aliased approach took the full quarter instead. That's the direct tradeoff: speed of migration vs. risk per migration step. We chose risk reduction.

Five initiative documents is high overhead for what is nominally a version bump. In a team with more bandwidth, some of that documentation work gets compressed. Here, the documentation was the migration plan — it was necessary — but it represents real engineering time that went into artifacts rather than code.


Go-Live

Clean. No production issues.

Not because we were lucky. Because by the time the first service went live on V2, the risk had been systematically removed. The connector was tested at ≥95% unit coverage. The integration tests had been run against the actual broker. The dead code had been retired. The dashboard was showing us a healthy cluster baseline. The alerts were configured to catch anything we didn't anticipate. The feature flag meant rollback was a one-operation decision if anything changed.

The go-live was boring. That's exactly what we were building toward.

When infrastructure migrations are exciting at go-live, it's usually because the preparation wasn't thorough enough and you're solving problems in production. The goal is to make go-live the least interesting moment in the project — the point where you confirm that everything you already verified in controlled conditions holds in production.

It held.


What This Changes

The standard mental model for an infrastructure upgrade is: pick a date, coordinate the teams, do it all at once, manage the risk in real time.

The aliased dependency approach inverts that. You do the risky work — the design, the testing, the observability setup — before any production exposure, and you structure the migration so each individual move carries almost no risk on its own.

The lesson isn't specific to RabbitMQ. Any shared dependency upgrade has this property: the hardest part isn't the technical change, it's managing the blast radius. Whenever the blast radius is large, the right move is to find a way to shrink it — not to coordinate more people around a large blast.

Sometimes the architecture decision is a connector design. Sometimes it's a feature flag. Sometimes it's just writing the migration guide so each team can move independently instead of waiting for a synchronized cutover.

The work is designing the migration to be boring. Then executing it.