PromptGuard: Practical framework for adversarial testing and runtime guardrails to secure LLM-powered applications against prompt injections
Build PromptGuard: adversarial testing and runtime guardrails to protect LLM apps from prompt injection attacks with clear patterns and code.
PromptGuard: Practical framework for adversarial testing and runtime guardrails to secure LLM-powered applications against prompt injections
Prompt injection is the real-world attack vector that breaks assumptions about how language models interpret and follow instructions. If your application combines user content, system prompts, and tools, a single malicious input can subvert the model and cause data leakage, unauthorized actions, or policy violations. PromptGuard is a practical, developer-focused framework you can implement today: combine adversarial testing (to find weaknesses) and runtime guardrails (to contain them).
This post is a compact, actionable blueprint: threat model, design principles, component map, an integration pattern with a code example, and a checklist you can use to ship safely.
Threat model: what prompt injections look like in production
- User-provided content containing instructions: “Ignore previous instructions and do X”.
- Output-based attacks: model responds with hidden control sequences or tool-invocation markers.
- Context poisoning: appended user prompts manipulate later requests by changing system instruction meaning.
- Tool / API escalation: model crafts inputs that trick tool clients into performing unsafe operations.
Adversaries can be noisy users or sophisticated red teams. PromptGuard assumes attackers can submit arbitrary text and may try to hide instructions in code blocks, data, or conversational history.
Design principles for PromptGuard
- Adversarial-first: build your defense by actively attacking your own system. If a technique can’t be found by testing, it will surface in production.
- Layered defenses: no single classifier solves everything. Combine detection, containment, and human review.
- Least privilege & capability gating: restrict which requests can call sensitive tools or access secrets.
- Explainable decisions: log why you blocked something; make policies auditable.
- Fail-safe human-in-the-loop: when confidence is low, degrade to safe responses or human review.
Components of PromptGuard
-
Adversarial testing suite
- Corpus of injection patterns and transformations (obfuscation, encoding, stegography-like variants).
- Mutation engine to generate variants (punctuation swaps, Unicode homoglyphs, nested quotes).
- Scoring harness that runs candidate attacks against your real prompt stack and evaluates outcomes.
-
Runtime guardrails
- Input validation and normalization: canonicalize Unicode, strip invisible characters, detect encodings.
- Instruction integrity checks: ensure system prompts remain authoritative and unchanged across hops.
- Adversarial classifier: lightweight model or heuristic that scores user inputs and intermediate outputs.
- Enforcement layer: reject, sanitize, constrain, or escalate based on policy.
- Capability gating: token-level or intent-level checks before allowing tool access.
-
Observability & feedback
- Logging of suspicious inputs, model decisions, and enforcement actions.
- Continuous learning: feed escaped vectors back to the test suite.
Adversarial testing: practical patterns
- Collect examples: real queries, known injection templates, OSINT. Start with a seed set of 100–200 examples.
- Mutate aggressively: automated transformations find evasions humans miss.
- Execute against the full prompt stack: system prompt + user input + tool instructions. Record whether the model followed the malicious directive or the normal flow.
- Score by impact, not just detection. A low-confidence probe that leads to secret exfiltration scores higher than one that just triggers an error.
Example attack patterns to include: direct instruction override, chained instructions inside data sections, obfuscated commands in code blocks, and attempts to coerce tool invocation.
Runtime guardrails: enforcement strategies
Enforcement should map classifier scores to actions. Common strategies:
- reject: refuse to process and return a safe error.
- sanitize: remove or neutralize suspicious fragments.
- constrain: run the request in a read-only or restricted context.
- human_review: escalate high-risk cases.
Policy examples can be small JSON objects. Use inline backticks and escaped braces for configs, for example { "action": "reject", "threshold": 0.8 }.
Integration pattern: middleware around your LLM client
Implement PromptGuard as a middleware or sidecar service that sits between clients and the model. Responsibilities:
- Normalize input and remove weird encodings.
- Run adversarial classifier on user text and intermediate outputs.
- Enforce capability gates (tool calls, secret access).
- Keep an immutable audit trail with hashed prompts and enforcement decisions.
Code example: minimal PromptGuard middleware (Python)
Below is a compact example of a middleware function you can adapt. It assumes you have two helper primitives: classify_adversarial(text) returning a score 0..1 and call_llm(prompt) that interacts with your LLM.
import hashlib
THRESHOLD = 0.75
def hash_prompt(text):
h = hashlib.sha256()
h.update(text.encode('utf-8'))
return h.hexdigest()
def promptguard_middleware(user_input, system_prompt, call_llm):
# 1) Normalize
cleaned = ' '.join(user_input.split())
# 2) Adversarial score for user input
score, category = classify_adversarial(cleaned)
# 3) Policy mapping
if score >= THRESHOLD:
# Reject or escalate depending on category
if category == 'high-risk':
return { 'status': 'blocked', 'reason': 'adversarial input detected' }
else:
# Sanitize and continue at a constrained level
cleaned = sanitize_text(cleaned)
# 4) Ensure system prompt integrity
system_hash = hash_prompt(system_prompt)
# store system_hash in audit log, assert unchanged in downstream calls
# 5) Execute the model call
response = call_llm(system_prompt + "\n" + cleaned)
# 6) Post-response check (model output can also be adversarial)
out_score, out_cat = classify_adversarial(response)
if out_score >= THRESHOLD:
# Soft-fail: suppress any tool invocation tokens
response = remove_tool_tokens(response)
log_enforcement('suppress_output', out_score, out_cat)
return { 'status': 'ok', 'response': response }
Notes: classify_adversarial can be a small fine-tuned model, a rule-based ensemble, or a call to a safety endpoint.
Policy enforcement patterns
- Reject by default for any request requesting to run tools or access secrets unless it has an allowlist. Keep an allowlist of trusted request patterns.
- Treat system prompts as immutable: compute and persist a hash as shown above and compare across service boundaries.
- Sanitize aggressively: remove embedded instructions like “ignore earlier” and strip out content after suspicious markers.
- Constrain model output: post-process to remove or neutralize tokens that your tool runner would interpret as commands.
Observability and feedback loop
Every enforcement decision must be logged with: hashed prompt, score, category, action taken, and a short human-readable rationale. Logs feed the adversarial test suite as new seeds. Over time your mutation engine should focus on evasions that bypassed earlier logic.
Example: capability gating for tools
- Annotate requests with required capabilities, e.g.,
call_tools: ['run_query']. - Before executing, validate the request’s provenance and the adversarial score.
- If the score gt;= threshold, deny tool access and return a safe message.
This ensures that even if the model emits a tool invocation in text, your orchestration layer will not execute it unless the request is approved.
Summary and deployment checklist
- Seed an adversarial corpus (100+ examples) and build a mutation engine.
- Implement a lightweight adversarial classifier; start rule-based then evolve to ML.
- Add a middleware/sidecar to validate inputs, enforce policies, and gate capability usage.
- Persist system prompt hashes and audit every enforcement decision.
- Log and feed bypasses back to the test harness; continuously red-team your production stack.
- Default to safe failures: reject or human-review rather than silently allowing risky operations.
Checklist (copyable):
- Run initial red-team suite against full prompt stack
- Normalize and canonicalize all user inputs
- Implement adversarial scoring and map scores to actions
- Gate tool access and secret retrieval behind explicit approvals
- Log immutable artifacts: prompt hashes, scores, actions
- Add human escalation for high-risk detections
PromptGuard is pragmatic: you don’t need perfect classifiers to reduce risk— you need repeatable tests, clear policy mappings, and a runtime enforcement layer that fails safely. Start small, automate red-teaming, and iterate.