[Agents] Add guardrails to PR triage agent#2656
Conversation
🤖 GitHub commentsJust comment with:
|
| - \`gh pr view ${PR_NUMBER} --repo ${REPO} --json title,author,body,files,additions,deletions,baseRefName,headRefName\` | ||
| - \`gh pr diff ${PR_NUMBER} --repo ${REPO}\` |
There was a problem hiding this comment.
Can we run these commands beforehand, write the output to a file, and tell the LLM about them. Then it no longer needs a GH_TOKEN and does not need to be allowed to run gh CLI at all.
Then can we apply an allowlist to opencode such that only the tools necessary by the skills to complete the review are available.
There was a problem hiding this comment.
You mean the github actions workflow runs this command and outputs to a file rather than LLM agent?
|
Hi! We just realized that we haven't looked into this PR in a while. We're We're labeling this PR as If there is no activity on this PR within the next 2 weeks, it will be Thank you for your contribution! |
Add prompt-injection guardrails to PR triage agent
Summary
The PR triage workflow fetches PR title, body, and diff at runtime via
ghand passes them to the model with no adversarial framing, no structural separation of untrusted data, and no guidance that PR content is attacker-controlled. This PR adds prompt-level guardrails to harden the agent against injection attacks.Changes
1. Adversarial framing in the system prompt (
.github/workflows/pr-triage.yml)Added a
## Security — prompt-injection guardrailssection to the prompt, immediately after## Repository and PR. It explicitly tells the model that PR content is untrusted, attacker-controlled data and instructs it to:2. Envelope framing for untrusted tool output (
.github/workflows/pr-triage.yml)Rewrote the
## Toolssection to instruct the model to treat allghoutput as untrusted data, mentally framing it within<pr_metadata>/<pr_diff>boundaries and reiterating that content within those boundaries may be adversarial.3. Injection-awareness guidance in the skill (
.agents/skills/ecs-pr-triage/SKILL.md)Added a
## Prompt-injection awarenesssection so the skill itself carries the guidance regardless of whether it's invoked from the CI workflow or interactively in Cursor.4. Injection signal in the report template (
.agents/skills/ecs-pr-triage/report-template.md)Added a
**Prompt-injection signals:**bullet under### Risk notesso the agent has a structured place to report any suspicious directives found in PR content.