Abstract shield protecting a neural network from malicious text prompts
PromptGuard: combining red-teaming and runtime policies to protect LLM-driven systems

PromptGuard: Practical framework for adversarial testing and runtime guardrails to secure LLM-powered applications against prompt injections

Build PromptGuard: adversarial testing and runtime guardrails to protect LLM apps from prompt injection attacks with clear patterns and code.

PromptGuard: Practical framework for adversarial testing and runtime guardrails to secure LLM-powered applications against prompt injections

Prompt injection is the real-world attack vector that breaks assumptions about how language models interpret and follow instructions. If your application combines user content, system prompts, and tools, a single malicious input can subvert the model and cause data leakage, unauthorized actions, or policy violations. PromptGuard is a practical, developer-focused framework you can implement today: combine adversarial testing (to find weaknesses) and runtime guardrails (to contain them).

This post is a compact, actionable blueprint: threat model, design principles, component map, an integration pattern with a code example, and a checklist you can use to ship safely.

Threat model: what prompt injections look like in production

Adversaries can be noisy users or sophisticated red teams. PromptGuard assumes attackers can submit arbitrary text and may try to hide instructions in code blocks, data, or conversational history.

Design principles for PromptGuard

  1. Adversarial-first: build your defense by actively attacking your own system. If a technique can’t be found by testing, it will surface in production.
  2. Layered defenses: no single classifier solves everything. Combine detection, containment, and human review.
  3. Least privilege & capability gating: restrict which requests can call sensitive tools or access secrets.
  4. Explainable decisions: log why you blocked something; make policies auditable.
  5. Fail-safe human-in-the-loop: when confidence is low, degrade to safe responses or human review.

Components of PromptGuard

Adversarial testing: practical patterns

Example attack patterns to include: direct instruction override, chained instructions inside data sections, obfuscated commands in code blocks, and attempts to coerce tool invocation.

Runtime guardrails: enforcement strategies

Enforcement should map classifier scores to actions. Common strategies:

Policy examples can be small JSON objects. Use inline backticks and escaped braces for configs, for example { "action": "reject", "threshold": 0.8 }.

Integration pattern: middleware around your LLM client

Implement PromptGuard as a middleware or sidecar service that sits between clients and the model. Responsibilities:

Code example: minimal PromptGuard middleware (Python)

Below is a compact example of a middleware function you can adapt. It assumes you have two helper primitives: classify_adversarial(text) returning a score 0..1 and call_llm(prompt) that interacts with your LLM.

import hashlib

THRESHOLD = 0.75

def hash_prompt(text):
    h = hashlib.sha256()
    h.update(text.encode('utf-8'))
    return h.hexdigest()

def promptguard_middleware(user_input, system_prompt, call_llm):
    # 1) Normalize
    cleaned = ' '.join(user_input.split())

    # 2) Adversarial score for user input
    score, category = classify_adversarial(cleaned)

    # 3) Policy mapping
    if score >= THRESHOLD:
        # Reject or escalate depending on category
        if category == 'high-risk':
            return { 'status': 'blocked', 'reason': 'adversarial input detected' }
        else:
            # Sanitize and continue at a constrained level
            cleaned = sanitize_text(cleaned)

    # 4) Ensure system prompt integrity
    system_hash = hash_prompt(system_prompt)
    # store system_hash in audit log, assert unchanged in downstream calls

    # 5) Execute the model call
    response = call_llm(system_prompt + "\n" + cleaned)

    # 6) Post-response check (model output can also be adversarial)
    out_score, out_cat = classify_adversarial(response)
    if out_score >= THRESHOLD:
        # Soft-fail: suppress any tool invocation tokens
        response = remove_tool_tokens(response)
        log_enforcement('suppress_output', out_score, out_cat)

    return { 'status': 'ok', 'response': response }

Notes: classify_adversarial can be a small fine-tuned model, a rule-based ensemble, or a call to a safety endpoint.

Policy enforcement patterns

Observability and feedback loop

Every enforcement decision must be logged with: hashed prompt, score, category, action taken, and a short human-readable rationale. Logs feed the adversarial test suite as new seeds. Over time your mutation engine should focus on evasions that bypassed earlier logic.

Example: capability gating for tools

This ensures that even if the model emits a tool invocation in text, your orchestration layer will not execute it unless the request is approved.

Summary and deployment checklist

Checklist (copyable):

PromptGuard is pragmatic: you don’t need perfect classifiers to reduce risk— you need repeatable tests, clear policy mappings, and a runtime enforcement layer that fails safely. Start small, automate red-teaming, and iterate.

Related

Get sharp weekly insights