Harish Kumar
war-storyarchitect-mindset

One Day, One Dev, One Bot: Building CapBot at the Capillary AI Hackathon

Built a Slack-native JIRA bot solo in one day. Third prize. Here's what the deck says versus what the code actually does.

June 20, 20259 min

The constraint is the design brief.

One person. One day. Build something that works, not something that demoes. Those are very different things, and most hackathon teams get confused about which one they're trying to do.

I built CapBot alone for the Capillary internal hackathon today. Third prize. Here's what it actually is, what it actually does, and what I'm less certain about than the slide deck suggests.


The Problem

Our JIRA hygiene is a mess. I don't mean the tickets are poorly written — I mean the cost of every ticket is invisible until it compounds.

Duplicate tickets pile up because nobody searches before filing. Engineers stay in Slack but their JIRA context requires a browser switch, a login context, a filter rebuild. Bug resolution time averages 6–7 days — not because the bugs are hard, but because ticket lifecycle management is friction nobody budgets for. Small decisions accumulate: search is annoying so people don't search, creation is annoying so people don't create, triage takes a Confluence tab and a meeting.

The pitch was simple: bring JIRA into Slack, and make the intelligence do the dull parts.

Natural language queries instead of JQL. Duplicate detection before you file. Screenshot-to-ticket for bugs you catch mid-testing. The workflow stays in the channel where the work already happens.


What I Built

CapBot is a Slack bot that wraps a JIRA agent loop. The core interaction is: you describe something in plain English, and the bot decides what to do with it.

Four capabilities shipped today:

Natural language JIRA search. "Show me open P1 bugs assigned to me" gets routed through intent detection, translated into a search, and the results come back as a readable Slack message — not a raw JIRA dump.

Duplicate detection before ticket creation. When you start creating a ticket, the bot embeds your description, searches existing tickets via vector similarity, and surfaces potential duplicates. If it finds something close, you decide whether to file or link.

Image-to-ticket. Attach a screenshot in Slack. GPT-4o extracts the text, summarizes it, and the bot builds a ticket description. No manual transcription. Two LLM calls per image — extract, then summarize — then the output gets appended as an ADF table to the JIRA ticket body.

Sprint summaries. JIRA data → structured Slack report. Useful for standups, less useful for anything that requires nuance the data doesn't capture.


The Constraints

42 hours, solo, live demo required. The bot had to actually work — not simulate a response, not show a recording. A Slack App takes days to get approved in any enterprise environment, so official OAuth was off the table. The AI calls had to stay cheap — at hackathon scale, blowing the cost model would undercut the business case entirely. And the architecture had to be presentable, not just functional: a coherent demo is itself part of the evaluation.

What I Considered

A rule-based bot — just keyword-match to JQL shortcuts — would be faster to build but defeat the point. The intelligence is what makes the product interesting. A webhook architecture from JIRA to Slack would have given me real-time push, but it requires a publicly accessible server, which means infrastructure setup on a one-day timeline. The chosen path — inbound Slack events + JIRA polling on demand — puts control on the bot's side and keeps the infra surface minimal. For intent detection, I considered one model for everything, but the cost difference between GPT-3.5-turbo (classification) and GPT-4o (multimodal extraction) justified the split. Classification doesn't need reasoning capability; it needs speed and cheapness.

The Architecture

The stack is Node.js with @slack/bolt handling Slack events. Everything routes through an AgentOrchestrator that does intent detection first — before anything else.

Intent detection runs on every message. gpt-3.5-turbo classifies the input into one of: CREATE_TICKET, SEARCH_JIRA, CLARIFICATION, HEALTH_REPORT, or a catch-all. That's a deliberate choice. GPT-3.5-turbo is fast and cheap at classification — it's not being asked to reason, just sort. The expensive model, GPT-4o, only runs on image extraction where multimodal capability is actually required.

For a hackathon, that cost model matters. At 50 daily active users, intent detection costs $10–15/month. Switch everything to GPT-4-turbo and you're at $250–350/month. You need a reason to justify that jump, and "it feels smarter" isn't one.

The vector search is LanceDB — embedded, no separate infrastructure, cosine similarity on text-embedding-3-small embeddings at 1536 dimensions. JIRA tickets get synced on a configurable interval (default daily), embedded as title + description, and stored. Query embeddings run at search time.

Google Genkit handles AI orchestration — the plugin layer connecting the Genkit runtime to OpenAI models. Keeping the orchestration layer separate from direct SDK calls means swapping model providers later doesn't require rewriting the business logic.


What the Deck Says vs What the Code Does

This section exists because I wrote the deck, and slide decks simplify.

"Intelligent caching layer." What's implemented is an in-memory Map() with 30-minute TTL for conversation history and session state. There's no Redis, no LLM response cache, no JIRA API cache. The cost analysis document in the repo explicitly lists "Implement Redis caching for summaries" as the single highest-impact optimization — which means it's a known gap, not a completed feature.

"Similarity threshold prevents duplicates." The 0.8 threshold is commented out in production:

// const minSimilarityScore = options.minScore || 0.8;
// if (options.minScore) { const minDistance = 1 - options.minScore; ... }

The bot returns the top-5 LanceDB results unconditionally. The precision gate I described in the pitch never shipped. You get results regardless of how distant they are.

"2s / 3s / 7s response times." There's no timing instrumentation in the codebase. No console.time(), no Date.now() diffs, no benchmark files. Those numbers are informal observations from manual testing during the build. They may be directionally accurate; they are not profiled measurements.

Team credit on the deck. Git history is unambiguous: every commit, from initial scaffolding through the final multimodal image feature, is by a single author. I built this alone.

Decks require confidence. Code requires honesty. Where they diverge, the code is the record.


CursorRIPER: Building with Structure Under Pressure

This is the part I find most worth documenting.

CapBot wasn't built by typing prompts into a chat window and copying the output. It was built using CursorRIPER — a formal framework for AI-assisted development with five declared modes the AI must operate within sequentially:

RESEARCH → INNOVATE → PLAN → EXECUTE → REVIEW

The constraint: the AI cannot write code during RESEARCH or PLAN phases. Cannot make architectural changes during EXECUTE. Every response begins with a declared mode. The AI isn't allowed to skip ahead.

That sounds bureaucratic. Under hackathon pressure — where the instinct is to start typing and figure it out as you go — it's actually load-bearing.

The repo has a .cursor/rules/state.mdc file that tracks machine state. The INITIALIZATION_DATE is when I first set up the CursorRIPER framework configuration on my machine — that predates this hackathon by about a year. It's a framework bootstrap timestamp, not the CapBot build date.

PROJECT_PHASE: "DEVELOPMENT"
RIPER_CURRENT_MODE: "EXECUTE"
INITIALIZATION_DATE: "2024-07-26T12:50:00Z"  # CursorRIPER framework bootstrap date, not the hackathon date

There's a memory-bank/ directory with five persistent context files: projectbrief.md (the product spec, written to the AI), systemPatterns.md (architectural conventions the AI follows), techContext.md (stack decisions), activeContext.md (current focus), progress.md (done and next).

The commit history shows the framework in operation:

PLAN: Complete detailed planning for core infrastructure
EXECUTE: Implement core pipeline and storage system
EXECUTE: Implement BotController, update index.js
EXECUTE: Implement EventRouter and MessageHandlers

The model isn't free-ranging across the codebase during implementation. It's operating with context it was given in structured form, following conventions it agreed to upfront.

What this bought me in a single day: the AI didn't drift. When I came back to a context after switching focus, the memory-bank/ files gave it back the state it needed. Decisions made in PLAN phase propagated into EXECUTE phase without me restating them. The architectural choices stayed consistent across eight hours of development.

I don't know yet whether this methodology scales to a longer project with more complexity. What I know is that under the pressure of a one-day constraint, having a structured process — even one enforced mostly through prompt conventions — prevented the kind of accumulated confusion that kills solo hackathon builds.


Third Prize

Third is not first. It's also not nothing.

I don't know exactly what the judges weighted. The bot works — the demo ran live, queries returned, a ticket was created from a screenshot. The gap between third and first is probably the difference between something that functions and something that feels finished.

What I notice is that the things I left undone are all in the same category: the precision layer. The similarity threshold gate. The caching layer. The response time instrumentation. All of these are about tightening the loop between "works" and "works reliably," and none of them shipped.

That's a useful data point. Hackathon constraints force you to find out which decisions you defer under pressure. I deferred precision. I shipped functionality. You could argue that's the right call for a demo. You could also argue it's the gap between a proof of concept and a production system.


What's Real

The architecture is sound. Intent routing, vector search, model selection by task — these aren't hackathon shortcuts, they're defensible design decisions that would hold in a production environment.

The CursorRIPER methodology worked. The codebase is coherent in a way that a solo eight-hour build often isn't. The memory-bank/ files are readable project documentation. The commit history is legible.

The business case is projected, not measured. "95% reduction in resolution time" is a reasonable goal if adoption follows. It is not what happened today.

What I actually shipped: a working Slack bot that can query JIRA in plain English, detect potential duplicate tickets before creation, and extract structured information from screenshots. The plumbing is there. The precision isn't.

The Redis caching, the similarity threshold, the timing instrumentation — those are the next layer. They're documented. They're known gaps. I'll close them or I won't, and whether CapBot goes beyond proof of concept will probably depend on whether the problem it solves is real enough to the people who felt it in the demo.

I think it is. That's the bet.


What Changed

The CursorRIPER section is the part I keep returning to. Not the vector search, not the cost model — the part where I had a working system at end of day and could trace every decision back to a declared mode in the RESEARCH → PLAN → EXECUTE sequence.

What that taught me: under time pressure, structure isn't overhead. It's load-bearing. The engineers I've seen produce poor work in hackathons don't lack ability — they lack a stopping mechanism. They start building before they've named the problem. CursorRIPER is a stopping mechanism. The RESEARCH phase won't let you write code. The PLAN phase won't let you make architectural changes. You're forced to think before you act, in a context where every instinct says to start typing.

What I'd do differently: ship the precision layer first, not last. The similarity threshold, the caching, the timing instrumentation — these are what separate a proof of concept from a thing people trust. Functionality ships demos. Precision ships tools. I know which one I deferred under pressure. Next time, I'll defer the other one.