Updated, operational prompt-engineering patterns for system prompts, chain-of-thought, structured output, and secure agents; includes GPT-5.4, IH‑Challenge, Anthropic XML guidance, and Gemini Flex/Priority notes.
Craft system prompts, few-shot examples, chain-of-thought strategies, and structured output schemas for production AI systems on OpenAI, Anthropic, and Google Gemini.
When to use
Writing or refining system prompts for chat applications
Designing few-shot examples that steer model behavior
Implementing chain-of-thought reasoning for complex tasks
Extracting structured data from unstructured inputs
Building evaluation datasets and regression tests for prompts
When NOT to use
The task is simple enough that default model behavior works (no prompt needed)
You need deterministic, rule-based logic — use code instead of prompts
The "prompt engineering" is really just API configuration (temperature, max_tokens, inference tier)
You're trying to make the model do something it fundamentally can't (real-time sensor feeds, external side-effects without an orchestrator)
The problem is better solved by fine-tuning or a custom model than prompt design
Core concepts
System prompt anatomy
┌─────────────────────────────────────────────┐
│ SYSTEM PROMPT │
├─────────────────────────────────────────────┤
│ 1. Role definition (who the model is) │
│ 2. Task description (what it should do) │
│ 3. Output format (how to structure results) │
│ 4. Constraints (what to avoid) │
│ 5. Examples (few-shot demonstrations) │
│ 6. Edge case handling (ambiguity rules) │
└─────────────────────────────────────────────┘
Important: modern LLMs are trained with an instruction hierarchy. OpenAI’s IH‑Challenge (Mar 10, 2026) and the Model Spec emphasize a clear priority ordering: System > Developer > User > Tool. Place safety‑critical and policy constraints in the system or developer layer so they remain highest priority and are resilient to lower‑priority inputs (including tool outputs and web content). See OpenAI IH‑Challenge for details: https://openai.com/index/instruction-hierarchy-challenge/ (Mar 10, 2026).
Model updates (note)
OpenAI released the GPT-5.4 family (including lower-latency mini/nano variants) used in agent deployments and enterprise runtimes in early 2026. When choosing a model, test both the full and compact variants for accuracy/latency tradeoffs. Source: OpenAI News (Apr 2026) and product announcements.
Few-shot: consistent format or tricky edge-cases; include diverse examples and order them intentionally (the last example often has outsized influence)
Chain-of-thought: multi-step reasoning or proofs — isolate thinking traces when possible and avoid leaking sensitive intermediate content
Self-consistency: sample multiple reasoning traces and take the consensus for high-stakes tasks
ReAct / tool-guided: when the model must propose actions (tool calls); implement an orchestrator to execute, verify, and sanitize tool outputs
Structured output: use schema validation (e.g., zod, JSON Schema) rather than relying on free-form parsing
Temperature guide
0: deterministic-style behavior for classification/extraction (note: greedy decoding still permits small variance)
0.3–0.5: balanced
0.7–1.0: creative
Workflow
Step 1: Define the task contract
Before writing any prompt, answer these questions:
type PromptContract = {
input: string; // What does the model receive?
output: string; // What should it produce?
format: string; // JSON, markdown, plain text?
constraints: string[]; // What must it avoid?
edgeCases: string[]; // How should it handle ambiguity?
examples: Array<{ input: string; output: string }>;
};
Step 2: Write the system prompt
Focus system-level text on what must always hold (safety, privacy, legal constraints). Put product-level preferences in developer instructions. Remember: models trained with instruction-hierarchy data are more likely to honor these distinctions.
(Example omitted here for brevity — keep a compact system prompt that lists role, task, output format, rules, and examples.)
Step 3: Implement structured output with OpenAI (Responses API)
Prefer the Responses API when you need agentic primitives (tools) or an orchestrator; OpenAI’s engineering posts describe using the Responses API together with a shell tool and hosted containers to execute sandboxed actions and keep execution auditable.
Treat tool outputs as lower-priority content in the instruction hierarchy: do not place tool-provided text into system messages without sanitization and verification. Validate and re-check tool outputs server-side before acting on them.
Use schema helpers (like zodResponseFormat) to produce machine-parseable responses and validate them server-side to guard against format drift.
Example (Responses + zod):
import OpenAI from "openai";
import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";
const FindingSchema = z.object({
severity: z.enum(["critical", "warning", "info"]),
line: z.number(),
message: z.string(),
suggestion: z.string(),
});
const ReviewSchema = z.object({
findings: z.array(FindingSchema),
});
const client = new OpenAI();
async function reviewCode(diff: string) {
const response = await client.responses.create({
// Use the current-generation family you validated for this task (example: gpt-5.4)
model: "gpt-5.4",
instructions: systemPrompt,
input: diff,
text: {
format: zodResponseFormat(ReviewSchema, "code_review"),
},
});
// Validate and parse on the server to guard against format drift
return ReviewSchema.parse(JSON.parse(response.output_text));
}
Notes:
Replace the model name with the specific variant you tested (gpt-5.4, gpt-5.4-mini, etc.). Compact variants reduce latency/cost but can change output fidelity.
Always run server-side validation and a fallback retry/clarification when format validation fails.
Prompt versioning & rollout (production pattern)
Adopt explicit versioning and staged rollout for system/developer prompts:
Keep prompts in version control alongside integration tests (prompts/ directory in the repo).
Use semantic versioning for prompts (e.g., prompt v1.2.0) and include a short changelog entry describing behavioral intent and test coverage.
Automate regression tests: unit-style prompts that assert final output, format compliance, and performance budgets (tokens, latency).
Canary and rollback: roll a new prompt to a small % of traffic, monitor key metrics (accuracy, format compliance, error rates), then promote or rollback automatically.
Runtime feature flags: allow switching between prompt versions at runtime and maintain per-version telemetry so you can compare behavior.
Rationale: treat prompts like code — the OpenAI Model Spec and iterative-deployment guidance encourage incremental rollout and monitoring of model/agent behavior.
Step 4: Implement with Anthropic
Anthropic’s official prompting guidance recommends explicit roles, clear structure, and — when helpful — XML-like tags to reduce ambiguity in multi-component prompts. Wrap examples and sections in tags such as <instructions>, <examples>, <example>, <context>, and <input> to improve format fidelity for Claude models. Source: Anthropic Prompting Best Practices (platform.claude.com).
Include 3–5 diverse examples wrapped in <examples> / <example> tags to improve format fidelity and edge-case handling for Claude models.
Step 5: Agents and orchestrators
When building agents, the model should only propose tool calls — an orchestrator must execute them in a sandbox and return sanitized outputs for the next step. OpenAI’s agent runtime examples demonstrate combining Responses API, a shell tool, and hosted containers to limit blast radius and keep execution auditable.
Design a tight execution loop: propose → execute → return result → propose next step. This reduces hallucination and gives you a place to enforce policy, rate limits, retries, and identity checks.
Treat tool outputs as untrusted: validate, sanitize, and re-check them against higher-priority system/developer instructions before using them to make decisions.
Prompt safety checklist (operational)
Minimize authority: give the agent the least privilege needed (allowlist endpoints, limit API scopes, and avoid broad filesystem or network write permissions).
Verification steps: require explicit verification for sensitive or irreversible actions (human approval, re-authentication, or second-factor confirmation).
Sanitize external content: canonicalize and filter inputs from web pages, emails, or tools; avoid passing raw HTML or scraped text into the model without normalization.
Constrain impact: default to read-only or simulated-run modes and require an explicit intent switch for irreversible operations.
Audit & short-lived credentials: log proposals and tool outputs, and use time-limited credentials for downstream calls so compromised outputs cannot be reused.
Chain-of-thought remains a powerful technique for multi-step reasoning, but it increases token costs and can reveal intermediate reasoning that might be sensitive.
OpenAI’s internal monitoring work shows chain-of-thought traces can be useful for detecting misalignment and failure modes. Make thinking traces auditable and optionally separable from final answers so you can store, inspect, or purge them according to privacy policy. (See OpenAI monitoring posts, Mar 2026.)
Prefer concise, auditable reasoning blocks and consider separating the "thinking" trace from the answer (e.g., <thinking> vs <answer> blocks) so you can store or purge traces depending on privacy needs.
Example template:
<thinking>
1. Restate the problem
2. Break into subproblems
3. Solve each subproblem with short steps
4. Verify constraints
</thinking>
<answer>
[Concise final answer here]
</answer>
(unchanged — use structured JSON schema, return null for missing fields, parse relative dates against provided reference date)
Example 3: Multi-turn agent instructions
(unchanged — agent system prompt that lists available tools, behavior loop, citation rules, and constraints; ensure orchestrator validates tool output)
Decision tree
(unchanged — retain mapping from problem type to prompt pattern: classification, extraction, reasoning, generation, agent/tool use)
Edge cases and gotchas (UPDATED)
Prompt injection: instruction-hierarchy training (IH‑Challenge) reduces some classes of prompt-injection attacks, but never assume immunity. Modern attacks increasingly resemble social engineering; architectural mitigations (least privilege, verification, sandboxing, short-lived tokens) remain vital. See OpenAI guidance: https://openai.com/index/designing-agents-to-resist-prompt-injection/ (Mar 11, 2026).
Tool outputs are lower-priority: do not treat tool-proposed text as authoritative commands. Validate before using.
Token budget & context: system prompts count against the context window — measure and compact context using retrieval, summarization, or the Responses API’s hosted workspace when you need file access.
Model drift and portability: prompts that work for one provider can fail on another. Test across providers and include provider-specific tags or examples when needed.
Few-shot ordering: the last example often has the most influence — order examples intentionally.
Negative instructions vs positive rules: prefer "Always do Y" over "Don't do X" where possible.
Output format compliance: models may drift — always validate parsed outputs server-side and fall back to a retry or clarification flow.
Temperature 0 is not true determinism: small variance can remain. For strict determinism use programmatic checks and validation.
Long prompts degrade: distill system prompts and use external context stores for large documents.
Anthropic XML tips: official Claude guidance recommends consistent descriptive tags (<instructions>, <examples>, <input>) and nesting when content has a natural hierarchy to improve parsing and format fidelity.
Gemini inference tiers (Flex vs Priority): use Flex for cost-sensitive, latency-tolerant workloads and Priority for interactive, user-facing workloads. Implement graceful-downgrade logic that retries or queues work to Flex/standard when Priority is unavailable and monitor rate limits and cost tradeoffs. Source: Google (Apr 2, 2026).
Evaluation criteria
Accuracy: % of outputs matching ground-truth labels
Format compliance: % of outputs parseable as the requested format
Consistency: variance across multiple identical requests (temperature 0)
Cost efficiency: tokens consumed per successful completion (include inference tier cost in estimates)
Latency: time-to-first-token and total generation time
Robustness: accuracy on adversarial / edge-case inputs
Provider portability: cross-provider test coverage and provider-specific validation
Concrete agent prompt-injection mitigations and architectural controls (OpenAI, Mar 11, 2026): link to OpenAI guidance on designing agents to resist prompt injection.
Recommendations for agent runtimes that combine the Responses API with sandboxed tools and hosted containers (OpenAI engineering posts, Mar 2026).
Anthropic official prompting guidance added: XML-like tags and example-structuring recommendations (platform.claude.com prompt engineering docs).
Scan OpenAI and Anthropic changelogs for model behavior changes that affect prompting (system prompt handling, structured-output schemas, reasoning-token limits). Check Google AI blog for Gemini prompting guidance. Update chain-of-thought templates, few-shot examples, and production prompt-versioning patterns.
Latest refresh trace
Reasoning steps, source results, and the diff that landed.
Apr 18, 2026 · 9:29 AM
triggerAutomation
editoropenai/gpt-5-mini
duration152.7s
statussuccess
sources discovered+1
Revision: v11
This update adds concrete guidance for new agent runtimes (model-native harnesses and native sandbox execution), chain-of-thought monitoring best practices, and operational rollout/testing patterns for prompt versions and compact model variants. It also includes Anthropic XML tagging recommendations and Gemini Flex/Priority inference-tier guidance.
Added: agent runtime and sandbox guidance (Agents SDK, Apr 15, 2026), chain-of-thought monitoring note (Mar 19, 2026), explicit test guidance for compact model variants; Updated: Edge cases and gotchas, Model updates block; Preserved: core workflow, examples, and structured-output recommendations.
Agent steps
Step 1Started scanning 12 sources.
Step 2OpenAI News: 12 fresh signals captured.
Step 3OpenAI Platform Changelog: No fresh signals found.
Step 4Anthropic News: 12 fresh signals captured.
Step 5Anthropic Docs Index: No fresh signals found.
Step 6Google AI Blog: 12 fresh signals captured.
Step 7Google AI Dev: 3 fresh signals captured.
Step 8Hugging Face Blog: 12 fresh signals captured.
Step 9OpenAI Model Spec: 12 fresh signals captured.
Step 10OpenAI Research: No fresh signals found.
Step 11Gemini API docs: 12 fresh signals captured.
Step 12Anthropic Prompting Best Practices: 12 fresh signals captured.
Step 13OpenAI Model Spec: No fresh signals found.
Step 14Agent is rewriting the skill body from the fetched source deltas.
Step 15Agent discovered 1 new source(s): OpenAI News (official blog).
Important: modern LLMs are trained with an instruction hierarchy. OpenAI’s IH‑Challenge (Mar 10, 2026) and the Model Spec emphasize a clear priority ordering: System > Developer > User > Tool. Place safety‑critical and policy constraints in the system or developer layer so they remain highest priority and are resilient to lower‑priority inputs (including tool outputs and web content). See OpenAI IH‑Challenge for details: https://openai.com/index/instruction-hierarchy-challenge/ (Mar 10, 2026).
### Model updates (note)
+
+- OpenAI released the GPT-5.4 family (including lower-latency mini/nano variants) used in agent deployments and enterprise runtimes in early 2026. When choosing a model, test both the full and compact variants for accuracy/latency/cost tradeoffs and include per-variant regression tests. Source: OpenAI News (Apr 2026): https://openai.com/index/gradient-labs and related announcements.
+
+- Agents SDK and model-native harnesses (Apr 15, 2026): OpenAI introduced a next-generation Agents SDK that includes native sandbox execution and a model-native harness. If you run agents, validate that your runtime supports sandboxing, short-lived credentials, and auditable execution traces. Do not assume behavior parity between an agent runtime and direct API calls—test both paths. Source: OpenAI News (Apr 15, 2026): https://openai.com/index/the-next-evolution-of-the-agents-sdk.
+
+- Chain-of-thought monitoring and misalignment detection (Mar 19, 2026): OpenAI published internal monitoring findings showing chain-of-thought traces can surface misalignment in coding agents. Where you collect thinking traces, make them auditable, subject to retention policies, and optionally detachable from user-facing answers for privacy and compliance. Source: OpenAI News (Mar 19, 2026): https://openai.com/index/how-we-monitor-internal-coding-agents-misalignment.
−- OpenAIreleasedtheGPT-5.4family(includinglower-latencymini/nanovariants)usedinagentdeploymentsandenterpriseruntimesinearly2026.Whenchoosingamodel,testboththefull and compactvariantsforaccuracy/latencytradeoffs. Source: OpenAINews (Apr 2026)andproduct announcements.
+- Gemini inference tiers (Apr 2, 2026): Google introduced Flex (cost-optimized,higher-latency)andPriority(high-reliability)inferencetiersfortheGeminiAPI.Routebackgroundorbatch-likeworktoFlexandinteractiveuser-facingrequeststoPriority;implementgracefuldowngradelogic and telemetry that recordswhichtierservedeachrequest. Source: GoogleAIBlog (Apr 2,2026):https://blog.google/innovation-and-ai/technology/developers-tools/introducing-flex-and-priority-inference/.
−- Google introduced Flex and Priority inference tiers for the Gemini API (Apr 2, 2026). Use Flex for cost-sensitive/background workloads and Priority for interactive, user-facing workloads; implement graceful-downgrade logic and monitor rate limits. Source: https://blog.google/innovation-and-ai/technology/developers-tools/introducing-flex-and-priority-inference/ (Apr 2, 2026).
### Prompt strategies
Research engine
OpenAI Prompt Engineering now combines 7 tracked sources with 1 trusted upstream skill packs. Instead of waiting on a single fixed link, it tracks canonical feeds, discovers new docs from index-like surfaces, and folds those deltas into sandbox-usable guidance.
OpenAI Prompt Engineering has unusually strong source quality and broad utility, so it deserves prominent placement.
Discovery process
1. Track canonical signals
Monitor 3 feed-like sources for release notes, changelog entries, and durable upstream deltas.
2. Discover net-new docs and leads
Scan 4 discovery-oriented sources such as docs indexes and sitemaps, then rank extracted links against explicit query hints instead of trusting nav order.
3. Transplant from trusted upstreams
Fold implementation patterns from OpenAI Docs so the skill inherits a real operating model instead of boilerplate prose.
4. Keep the sandbox honest
Ship prompts, MCP recommendations, and automation language that can actually be executed in Loop's sandbox instead of abstract advice theater.
+Summary: OpenAI Prompt Engineering was reviewed by the editor agent but no revision was applied.
+What changed: The agent analyzed signals but did not call revise_skill.
−Generated:2026-04-14T09:52:39.414Z
+Body changed:no
−Summary: This update aligns the prompt-engineering skill with recent 2026 signals: OpenAI's IH‑Challenge and agent-security guidance, the GPT-5.4 family, Anthropic's prompting best practices (XML tags), and Google Gemini's Flex/Priority inference tiers. It clarifies model selection, Responses API examples, Anthropic tags, and operational guardrails for agents.
−What changed: Added explicit GPT-5.4 guidance and model-variant notes; updated Responses API code sample to recommend using validated model variants; added Anthropic official XML-tag guidance and tracked recommendation; added Gemini Flex/Priority operational guidance and citations; reinforced instruction-hierarchy citations (IH‑Challenge) and agent prompt-injection mitigations.
−Body changed: yes
Editor: openai/gpt-5-mini
−Changed sections: Model updates (note), System prompt anatomy, Step 3: Implement structured output with OpenAI (Responses API), Step 4: Implement with Anthropic, Edge cases and gotchas (UPDATED), Research-backed changes included in this update
Experiments:
+- Re-run after the issue is resolved.
+- Add a higher-signal source.
−- Measurepromptportabilityandbehaviordrift across GPT-5.4 (and mini/nano variants) and Claude 4.6 with a shared test-suite of 200 prompts.
+- Checkgatewaycreditsorratelimits.
−- Bench the Gemini Flex vs Priority tiers on representative production workloads to quantify cost/latency tradeoffs and implement automatic-downgrade logic.
−- Instrument and A/B chain-of-thought storage policies (store, redact, ephemeral) to measure safety telemetry value vs privacy risk.
Signals:
- News (Anthropic News)
- Research (Anthropic News)
Update history8▶
2d ago4 sources
OpenAI Prompt Engineering was reviewed by the editor agent but no revision was applied.
4d ago4 sources
This update aligns the prompt-engineering skill with recent 2026 signals: OpenAI's IH‑Challenge and agent-security guidance, the GPT-5.4 family, Anthropic's prompting best practices (XML tags), and Google Gemini's Flex/Priority inference tiers. It clarifies model selection, Responses API examples, Anthropic tags, and operational guardrails for agents.
5d ago4 sources
Minor update: reinforced instruction-hierarchy guidance, added direct links to OpenAI prompt-injection guidance and Google Gemini inference-tier announcement, and clarified model/agent notes for Apr 2026.
Apr 11, 20264 sources
This update incorporates March–April 2026 research and engineering signals: OpenAI’s IH‑Challenge (instruction-hierarchy training), agent security guidance on resisting prompt injection, Responses API agent runtime patterns, and Google’s Gemini inference-tier guidance (Flex/Priority). It adds a concrete production pattern for prompt versioning & rollout, an operational prompt-safety checklist, and direct links to primary docs.
Apr 9, 20264 sources
This update aligns the skill with recent vendor guidance (OpenAI IH‑Challenge, agent hardening posts, and Google Gemini inference tiers). It adds operational mitigations for prompt injection, explicit Responses API agent-runtime guidance, recommendations for chain-of-thought monitoring and trace handling, and concrete guidance on Gemini's Flex/Priority tiers and how to set service_tier in client code.
Apr 7, 20264 sources
This update incorporates recent vendor signals: OpenAI’s instruction-hierarchy research (IH-Challenge), the Responses API agent runtime (shell & container workspace), Anthropic’s XML-style prompt structuring, and Google Gemini’s new inference tiers. Edits clarify instruction priorities, agent orchestration best practices, tool output handling, and model-provider specifics for production prompt design.
Apr 5, 20264 sources
OpenAI Prompt Engineering agent run was interrupted: Free credits temporarily have rate limits in place due to abuse. We are working on a resolution. Try again later, or pay for credits which continue to have unrestricted access. Pur
Apr 3, 20264 sources
OpenAI Prompt Engineering agent run was interrupted: Free credits temporarily have rate limits in place due to abuse. We are working on a resolution. Try again later, or pay for credits which continue to have unrestricted access. Pur