The Autonomous Feedback Loop: From Bug Report to Merged PR

Quick Take

An in-app feedback widget that classifies, sanitizes, and dispatches user reports to a sandboxed coding agent. The agent opens a pull request; a human merges it; the user gets a notification when the fix ships. Humans intervene at exactly two points. Everything else runs on cron.

The redactor at the trust boundary is the most important file in the system. Everything downstream gets to be reckless because the input has already been sanitized.
Two passes: a cheap read-only summary always runs; the expensive write-mode agent only runs for triage-eligible categories under safety rails.
Four independent safety rails compose: per-user submission quota, daily dispatch cap, concurrency cap, and a master kill switch. Any one stops the bleed.
The agent never merges its own PR. Branch protection plus a human merge button is the safety valve at the end.

Most product feedback dies in a backlog. A user hits a snag, fills out a form, the message lands in a queue, and a human eventually triages it, opens a ticket, drags it into a sprint, writes the code, reviews the PR, and ships it. If any one of those steps gets dropped, the loop quietly stales out. The user never hears back. The bug never gets fixed. The reporter gives up and stops reporting next time.

The pipeline below closes the loop differently. Once feedback enters the system, it gets classified, sanitized, handed to a coding agent in a sandboxed branch, opens a pull request, and notifies the user the moment that PR merges. Humans stay in the loop at exactly two points: approving non-obvious work before it dispatches, and pressing the merge button on the PR. Everything else runs on cron.

I wired this into one of the apps from the enterprise app vibecode recipe. It is not a turnkey product; it is a pattern. The pattern is what travels.

The shape of the loop in one frame

Six stages, in order:

Capture. An in-app dialog posts to a feedback API, contextualized with the page URL the user was on when they hit Send. A per-user rolling 24-hour quota throttles abuse at the very first hop.
Sanitize. The submission runs through a deterministic redactor that rewrites emails, phone numbers, identifiers, and money amounts into stable token strings. The raw text never leaves the database.
Classify. A dependency-free heuristic scores the redacted submission into one of a few buckets. Code bugs and code feature requests are auto-dispatch eligible. Opinions, questions, and unclear submissions route to a human.
Dispatch. A cron worker picks up the eligible queue under safety rails (concurrency cap, daily cap, kill switch) and hands each submission to a coding agent in an isolated sandbox with dev-only credentials.
Implement. The agent works in a branch named after the submission, runs the test suite, and opens a pull request.
Close the loop. When a human merges the PR, the user gets a notification linking back to their original report with one sentence describing what changed. If the PR is rejected, the submission stays open and an admin replies through the queue.

That is the entire shape. The rest of this piece is what happens at each boundary.

The trust boundary: redaction first, autonomy second

The single most important file in the system is the redactor. Everything downstream gets to be reckless because the input has already been sanitized.

The redactor is a deterministic, pure function. No network calls, no model inference, no surprises. It rewrites:

Emails into a placeholder token
Phone numbers, social security numbers, and card-shaped digit runs into placeholders
Identifier-shaped strings (UUIDs, internal keys, URL path IDs) into a generic ID token
Money amounts and percentages into amount and percent tokens
IP addresses and URLs containing PII into safe equivalents

The bias is intentionally conservative. False positives are fine; false negatives are not. A phone number that the redactor mistakes for an order code and tokenizes anyway is a non-issue. A phone number that slips through into a public PR body is a real incident.

The output of the redactor is the only description string the system sends to:

The coding agent's prompt
The pull request body, which is public-by-default on GitHub
Admin queue tooltips, which render the safe form by default and require an explicit click to reveal raw text

The raw user submission lives only in the database, behind admin RBAC. Every other surface gets the redacted version.

This is the architectural decision that lets you put autonomy downstream of an untrusted input. Without the boundary, none of the rest of the pipeline can exist safely. With it, every downstream component is free to optimize for performance and simplicity instead of security.

Two passes: read first, write second

Most pipelines try to do everything in one pass. This one deliberately splits read from write.

The read pass. A read-only agent run that produces a short, structured summary of what the submission means, which surfaces it likely touches, and what impact category it falls in. It populates the admin queue UI and seeds the implementation prompt. It is cheap, safe, and always runs, regardless of how the heuristic triaged the submission.

The write pass. A full write-mode agent in its own branch. It makes the change, runs the test suite, opens the PR. This is the expensive one. It runs only for triage-eligible categories and only under the safety rails below.

Both passes share the same job table and the same state machine. The discriminator is which kind of pass it is, not whether the row is special.

The reason for splitting them: read-mode work is so much cheaper than write-mode work that running it on every submission, regardless of triage outcome, is essentially free. And the read summary is what makes the admin queue useful. A human looking at twelve unprocessed submissions can read twelve one-paragraph summaries in a minute and react. Without the summary, they read twelve raw user paragraphs, lose context between them, and stall.

The triage classifier itself is a pure heuristic. Keyword scoring on the title plus description runs in microseconds with no network call, so the API stays fast and the verdict is reproducible. An optional LLM second opinion is gated behind a feature flag; if it errors or times out, the heuristic verdict stands. The user-selected submission type (bug, feature, general) is a soft prior, not a verdict, so a row tagged "bug" but dominated by opinion-shaped text still routes to a human.

Where humans stay in the loop

There are exactly two human checkpoints in the entire flow.

Checkpoint one: approving non-obvious dispatches. Auto-dispatch is for the cases the heuristic is confident about. Anything ambiguous (opinions, questions, unclear submissions, anything where the title says "bug" but the body sounds like an opinion) is held in the queue with the read-pass summary attached. An admin reviews and either replies manually, edits the dispatch prompt, or sends it through to the agent.

Checkpoint two: merging the pull request. The agent never merges its own PR. Branch protection on main requires CI green and at least one human reviewer. This is the single safety valve at the end of the pipeline. Even if the heuristic, the redactor, the safety rails, and the agent all fail simultaneously, a human still has to click merge before any code reaches production.

Everything else (the user notification, the audit-log writes, the queue state transitions, the read-pass summaries) runs on cron and webhook handlers without human input.

The contract is simple and worth stating explicitly: humans approve non-obvious dispatches, and humans merge PRs. If a step in your pipeline does not belong to one of those two categories, it should not require a human.

The safety rails

Four independent levers govern the dispatcher. Any one of them stops the bleed; together they compose into defense in depth.

A master kill switch. Flip it off and no new submissions get auto-queued. Submissions an admin already approved manually still drain through. Deliberate human actions are never gated by this flag. The intent is to let an operator pause auto-dispatch instantly without breaking the manual workflow underneath.

A daily dispatch cap. The worker counts dispatches since UTC midnight against a configurable ceiling. Hit the cap and pending submissions wait for tomorrow. This is the primary cost ceiling; agent runs cost real money, and an agent loop without a daily cap is an agent loop with an unbounded credit-card bill.

A concurrency cap. No more than N agent runs in flight at once. Prevents a flood of submissions from spinning up fifty parallel agents on a quiet morning when the cron worker first picks up the backlog.

A per-user submission quota. A rolling 24-hour limit on how many submissions a single user can post. Enforced at the capture API, before anything else runs. This is the throttle that keeps a runaway script or a frustrated tester from flooding everything downstream.

These knobs are independent. The kill switch can be off while the caps are wide. The caps can be tight while the kill switch is open. The intent is that you can adjust pipeline behavior in real time without restarting any service or redeploying code. When something looks weird in the queue, you reach for one of these four levers first, before you reach for an incident channel.

Idempotent everything

Every state transition checks the current state before writing.

Webhooks deliver at-least-once. The cron worker re-runs every few minutes. Agents retry on transient failure. If you write any of these handlers as "set state to X" without checking the current state first, you will eventually fight a bug where the same merged PR fires the user notification three times, the audit log gets duplicate entries, and the dashboard count drifts.

The pattern across every transition: read current state, compare to expected, write only if appropriate, log the no-op otherwise. Same pattern for the user notification, the audit-log write, the queue refresh. Once you adopt the pattern everywhere, the pipeline tolerates duplicate events as a matter of course.

Audit logging is its own discipline. Every state transition (submission created, triaged, dispatched, PR opened, merged, rejected, manually overridden) writes to a separate audit log table with redacted payloads. The audit data goes through the same sanitizer as the agent prompts, so even the internal trail does not preserve raw PII. This matters for two reasons: regulatory posture, and the ability to retroactively analyze how the pipeline is performing.

Where this is not the answer

The pattern is not universal. Skip it if any of these are true.

You do not have a meaningful test suite. The agent's PR is only as trustworthy as the CI it has to pass. Without solid coverage on the surfaces the agent touches, you are merging the model's output on faith. That is not autonomy; that is gambling. Build the test suite first; the agent loop is the second project.

Your domain has no sandbox. If your app cannot run with synthetic data in a dev environment, the agent has no place to make changes safely. Spend the engineering time on the sandbox before you spend it on this loop. The agent should never have credentials that can touch production data, payment systems, or admin APIs.

Your bug volume is low. This pipeline has fixed operational overhead. If you ship two bug reports a month, a human triaging them in fifteen minutes per ticket is faster and cheaper than building an agent loop. The break-even point is somewhere around a steady stream of small, well-scoped reports per week.

Your reports are mostly opinions, not bugs. If your feedback channel is a request box for new features and product opinions, this pattern is mostly busywork. Triage routes them to humans anyway, and the agent never runs.

The pipeline is for shops with a steady stream of small, well-scoped bugs and feature requests, where the cost of triage is the bottleneck and the cost of a sandboxed model run is acceptable.

Bottom line

Most product feedback dies in a backlog because the path from "user reports it" to "engineer ships it" is too long, too lossy, and too dependent on a single human staying on top of a queue.

You can shorten the path. The redactor at the trust boundary is what makes the rest of the pipeline possible. The two-pass split (cheap read always, expensive write conditionally) is what makes it economical. The two-checkpoint human contract (approve non-obvious dispatches, merge the PR) is what makes it safe. The four safety rails are what make it operationally sane. The idempotent state transitions are what keep it from bleeding noise back at you.

What you get on the other side: most product feedback either closes itself within hours, or arrives in the admin queue with enough triage and context that the human reply is a one-paragraph job. The backlog stops being a graveyard. It becomes a flow.

If you want the underlying stack this runs on, the enterprise app vibecode recipe is the architecture. If you want the security posture for the apps that handle user data like this in the first place, the M365 tenant lockdown runbook is the companion piece on identity and device hardening.

The shape of the loop in one frame

The trust boundary: redaction first, autonomy second

Two passes: read first, write second

Where humans stay in the loop

The safety rails

Idempotent everything

Where this is not the answer

Bottom line

Related reading

How I Vibe-Coded Two Enterprise Apps in 8 Months for $13,945

The SMB Business Stack We Run: Microsoft-Centric, Remote-First, No Lock-In

The Microsoft 365 Security Floor Every SMB Should Hit in 2026