Practical prompt-engineering patterns for system prompts, few-shot design, chain-of-thought, structured outputs, and agent orchestration across OpenAI, Anthropic, and Google Gemini.
Craft system prompts, few-shot examples, chain-of-thought strategies, and structured output schemas for production AI systems on OpenAI, Anthropic, and Google Gemini.
When to use
Writing or refining system prompts for chat applications
Designing few-shot examples that steer model behavior
Implementing chain-of-thought reasoning for complex tasks
Extracting structured data from unstructured inputs
Building evaluation datasets and regression tests for prompts
When NOT to use
The task is simple enough that default model behavior works (no prompt needed)
You need deterministic, rule-based logic — use code instead of prompts
The "prompt engineering" is really just API configuration (temperature, max_tokens, inference tier)
You're trying to make the model do something it fundamentally can't (real-time sensor feeds, external side-effects without an orchestrator)
The problem is better solved by fine-tuning or a custom model than prompt design
Core concepts
System prompt anatomy
┌─────────────────────────────────────────────┐
│ SYSTEM PROMPT │
├─────────────────────────────────────────────┤
│ 1. Role definition (who the model is) │
│ 2. Task description (what it should do) │
│ 3. Output format (how to structure results) │
│ 4. Constraints (what to avoid) │
│ 5. Examples (few-shot demonstrations) │
│ 6. Edge case handling (ambiguity rules) │
└─────────────────────────────────────────────┘
Important: modern LLMs are trained with an instruction hierarchy. OpenAI’s IH-Challenge and model spec make explicit the priority ordering: System > Developer > User > Tool. Design system and developer instructions accordingly—put safety-critical and policy constraints in the system or developer layer so they are highest priority and resilient to lower-priority inputs (including tool outputs).
Prompt strategies
Zero-shot: simple tasks, low token cost
Few-shot: consistent format or tricky edge-cases; include diverse examples
Chain-of-thought: multi-step reasoning or proofs — use sparingly in high-cost paths
Self-consistency: sample multiple reasoning traces and take the consensus for high-stakes tasks
ReAct / tool-guided: when the model must propose actions (tool calls); implement an orchestrator to execute and verify tool outputs
Structured output: use schema validation (e.g., zod, JSON Schema) rather than relying on free-form parsing
Temperature guide
0: deterministic-style behavior for classification/extraction (note: greedy decoding still permits small variance)
0.3–0.5: balanced
0.7–1.0: creative
Workflow
Step 1: Define the task contract
Before writing any prompt, answer these questions:
type PromptContract = {
input: string; // What does the model receive?
output: string; // What should it produce?
format: string; // JSON, markdown, plain text?
constraints: string[]; // What must it avoid?
edgeCases: string[]; // How should it handle ambiguity?
examples: Array<{ input: string; output: string }>;
};
Step 2: Write the system prompt
Focus system-level text on what must always hold (safety, privacy, legal constraints). Put product-level preferences in developer instructions. Remember: models trained on instruction-hierarchy datasets are more likely to honor these distinctions.
const systemPrompt = `You are a senior code reviewer specializing in TypeScript and React.
## Task
Review the provided code diff and return structured feedback.
## Output format
Return a JSON array of findings:
[
{
"severity": "critical" | "warning" | "info",
"line": <number>,
"message": "<concise description>",
"suggestion": "<specific fix>"
}
]
## Rules
- Focus on bugs, security issues, and performance problems
- Do not comment on style preferences unless they cause bugs
- If the code is correct and well-written, return an empty array []
- Never suggest changes that would break existing tests
- Limit findings to the top 5 most important issues
## Examples
Input: \`const x = data.map(d => d.name)\`
Output:
[
{
"severity": "warning",
"line": 1,
"message": "No null check on data before .map()",
"suggestion": "Use optional chaining: data?.map(d => d.name) ?? []"
}
]`;
Step 3: Implement structured output with OpenAI (Responses API)
Prefer the Responses API when you need agentic primitives (tools) or an orchestrator loop; the Responses API supports custom tools and the shell/container environment for workflows where the model proposes actions and the platform executes them.
Treat tool outputs as lower-priority content: do not place tool-provided text into system messages without sanitization and verification (OpenAI’s instruction hierarchy treats tools as lower priority than system/developer instructions).
Use schema helpers (like zodResponseFormat) to produce machine-parseable responses and validate them server-side.
Example (OpenAI Responses + zod):
import OpenAI from "openai";
import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";
const FindingSchema = z.object({
severity: z.enum(["critical", "warning", "info"]),
line: z.number(),
message: z.string(),
suggestion: z.string(),
});
const ReviewSchema = z.object({
findings: z.array(FindingSchema),
});
const client = new OpenAI();
async function reviewCode(diff: string) {
const response = await client.responses.create({
model: "gpt-4o",
instructions: systemPrompt,
input: diff,
text: {
format: zodResponseFormat(ReviewSchema, "code_review"),
},
});
// Validate and parse on the server to guard against format drift
return ReviewSchema.parse(JSON.parse(response.output_text));
}
Step 4: Implement with Anthropic
Anthropic’s docs recommend explicit roles and XML-like tags to structure prompts. Use tags like <instructions>, <examples>, and <input> to reduce ambiguity in multi-component prompts.
Include 3–5 diverse examples wrapped in <examples> / <example> tags to improve format fidelity and edge-case handling for Claude models.
Example (Anthropic pattern):
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const system = `
<instructions>
You are a senior code reviewer specializing in TypeScript and React.
</instructions>
<examples>
<example>
<input>const x = data.map(d => d.name)</input>
<output>[{"severity":"warning","line":1,"message":"No null check","suggestion":"Use optional chaining"}]</output>
</example>
</examples>
`;
async function reviewCodeClaude(diff: string) {
const response = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 2048,
system,
messages: [{ role: "user", content: diff }],
});
const text = response.content
.filter(block => block.type === "text")
.map(block => block.text)
.join("");
return JSON.parse(text);
}
Step 5: Agents and orchestrators
When building agents, do not rely on the model to execute actions — the model should only propose tool calls. Use an orchestrator that executes the calls in a sandbox (e.g., Responses API + shell tool + hosted container workspace) and returns sanitized outputs for the next step.
Design a tight execution loop: propose → execute → return result → propose next step. This reduces hallucination and gives you a place to enforce policy, rate limits, and retries.
Treat tool outputs as untrusted: validate, sanitize, and re-check them against higher-priority system/developer instructions before using them to make decisions.
Chain-of-thought and reasoning
Chain-of-thought remains a powerful technique for multi-step reasoning, but it increases token costs and can reveal intermediate reasoning that might be sensitive.
OpenAI and other vendors use chain-of-thought monitoring internally to detect misalignment. Prefer concise, auditable reasoning blocks and consider separating the "thinking" trace from the answer (e.g., <thinking> vs <answer> blocks) so you can store or purge traces depending on privacy needs.
Example template:
<thinking>
1. Restate the problem
2. Break into subproblems
3. Solve each subproblem with short steps
4. Verify constraints
</thinking>
<answer>
[Concise final answer here]
</answer>
Examples
Example 1: Classification with few-shot
const classificationPrompt = `Classify the support ticket into exactly one category.
Categories: billing, technical, account, feature-request, other
## Examples
Ticket: "I was charged twice for my subscription this month"
Category: billing
Ticket: "The API returns 500 errors when I send more than 10 requests"
Category: technical
Ticket: "Can you add dark mode to the dashboard?"
Category: feature-request
Ticket: "I can't log in after resetting my password"
Category: account
## Rules
- Return only the category name, nothing else
- If genuinely ambiguous, choose the most actionable category
- "other" is the last resort — use it only when no category fits`;
Example 2: Data extraction with structured output
const extractionPrompt = `Extract structured event information from the text.
Return JSON matching this schema exactly:
{
"event_name": "string",
"date": "YYYY-MM-DD or null",
"time": "HH:MM or null",
"location": "string or null",
"attendees": ["string"],
"confidence": 0.0-1.0
}
## Rules
- If a field is not mentioned, use null (not a guess)
- Parse relative dates against today's date provided in the user message
- List only explicitly named attendees, not implied ones
- Confidence reflects how clearly the information was stated`;
Example 3: Multi-turn agent instructions
const agentSystemPrompt = `You are a research assistant that helps users
find and synthesize information.
## Available tools
- web_search(query: string) — search the web for information
- read_url(url: string) — read the content of a web page
- save_note(title: string, content: string) — save a research note
## Behavior
1. When the user asks a question, search for relevant sources first
2. Read the top 2-3 results to gather information
3. Synthesize findings into a concise answer with citations
4. Save important findings as notes for future reference
## Citation format
Use inline citations: "The API supports 100 req/s [1]"
List sources at the end:
[1] https://docs.example.com/rate-limits
## Constraints
- Never fabricate information — if you can't find it, say so
- Always cite sources for factual claims
- Prefer official documentation over blog posts
- If search returns no results, suggest alternative queries`;
Decision tree
What type of prompt do you need?
├── Classification / Routing
│ ├── < 5 categories → Zero-shot with category list
│ └── > 5 categories or subtle distinctions → Few-shot with examples
├── Data extraction
│ ├── Fixed schema → Structured output (JSON mode / zodResponseFormat)
│ └── Variable schema → Describe output format in prompt
├── Reasoning / Analysis
│ ├── Single-step → Zero-shot with clear instructions
│ └── Multi-step → Chain-of-thought with <thinking> blocks
├── Generation / Writing
│ ├── Consistent style → Few-shot with 3+ examples
│ └── Creative → Higher temperature, fewer constraints
└── Agent / Tool use
├── Simple tool routing → ReAct pattern in system prompt
└── Complex orchestration → Agent orchestration skill
Edge cases and gotchas (UPDATED)
Prompt injection: instruction-hierarchy training reduces some classes of prompt-injection attacks, but never assume immunity — keep safety-critical rules in system/developer messages and sanitize tool outputs.
Tool outputs are lower-priority: do not treat tool-proposed text as authoritative commands. Validate before using.
Token budget & context: system prompts count against the context window — measure and compact context using retrieval, summarization, or the Responses API’s hosted workspace when you need file access.
Model drift and portability: prompts that work for one provider can fail on another. Test across providers and include provider-specific tags or examples when needed.
Few-shot ordering: the last example often has the most influence — order examples intentionally.
Negative instructions vs positive rules: prefer "Always do Y" over "Don't do X" where possible.
Output format compliance: models may drift — always validate parsed outputs server-side and fall back to a retry or clarification flow.
Temperature 0 is not true determinism: small variance can remain. For strict determinism use programmatic checks and validation.
Long prompts degrade: distill system prompts and use external context stores for large documents.
Anthropic XML tips: wrap examples in <examples>/<example> and use descriptive tags (<instructions>, <input>, <output>) for more reliable parsing.
Gemini inference tiers: for high-throughput background tasks use cost-optimized tiers (Flex); for interactive user-facing actions use higher-reliability tiers (Priority). Choose an inference tier as part of your cost/latency planning.
Evaluation criteria
Accuracy: % of outputs matching ground-truth labels
Format compliance: % of outputs parseable as the requested format
Consistency: variance across multiple identical requests (temperature 0)
Cost efficiency: tokens consumed per successful completion (include inference tier cost in estimates)
Latency: time-to-first-token and total generation time
Robustness: accuracy on adversarial / edge-case inputs
Provider portability: cross-provider test coverage and provider-specific validation
Research-backed changes included in this update
Instruction hierarchy priority (System > Developer > User > Tool) and its practical implications (OpenAI IH-Challenge)
Guidance on building agents using Responses API, shell tool, and hosted container workspace (OpenAI engineering post)
Anthropic prompt structuring and XML-tag recommendations (Anthropic docs)
Gemini API inference tiers (Flex and Priority) and how to treat them in production design (Google AI blog)
Activity
ActiveDaily · 9:00 AM7 sources
Automation & run history
Automation status and run history. Only the owner can trigger runs or edit the schedule.
Scan OpenAI and Anthropic changelogs for model behavior changes that affect prompting (system prompt handling, structured-output schemas, reasoning-token limits). Check Google AI blog for Gemini prompting guidance. Update chain-of-thought templates, few-shot examples, and production prompt-versioning patterns.
Latest refresh trace
Reasoning steps, source results, and the diff that landed.
Apr 18, 2026 · 9:29 AM
triggerAutomation
editoropenai/gpt-5-mini
duration152.7s
statussuccess
sources discovered+1
Revision: v11
This update adds concrete guidance for new agent runtimes (model-native harnesses and native sandbox execution), chain-of-thought monitoring best practices, and operational rollout/testing patterns for prompt versions and compact model variants. It also includes Anthropic XML tagging recommendations and Gemini Flex/Priority inference-tier guidance.
Added: agent runtime and sandbox guidance (Agents SDK, Apr 15, 2026), chain-of-thought monitoring note (Mar 19, 2026), explicit test guidance for compact model variants; Updated: Edge cases and gotchas, Model updates block; Preserved: core workflow, examples, and structured-output recommendations.
Agent steps
Step 1Started scanning 12 sources.
Step 2OpenAI News: 12 fresh signals captured.
Step 3OpenAI Platform Changelog: No fresh signals found.
Step 4Anthropic News: 12 fresh signals captured.
Step 5Anthropic Docs Index: No fresh signals found.
Step 6Google AI Blog: 12 fresh signals captured.
Step 7Google AI Dev: 3 fresh signals captured.
Step 8Hugging Face Blog: 12 fresh signals captured.
Step 9OpenAI Model Spec: 12 fresh signals captured.
Step 10OpenAI Research: No fresh signals found.
Step 11Gemini API docs: 12 fresh signals captured.
Step 12Anthropic Prompting Best Practices: 12 fresh signals captured.
Step 13OpenAI Model Spec: No fresh signals found.
Step 14Agent is rewriting the skill body from the fetched source deltas.
Step 15Agent discovered 1 new source(s): OpenAI News (official blog).
Important: modern LLMs are trained with an instruction hierarchy. OpenAI’s IH‑Challenge (Mar 10, 2026) and the Model Spec emphasize a clear priority ordering: System > Developer > User > Tool. Place safety‑critical and policy constraints in the system or developer layer so they remain highest priority and are resilient to lower‑priority inputs (including tool outputs and web content). See OpenAI IH‑Challenge for details: https://openai.com/index/instruction-hierarchy-challenge/ (Mar 10, 2026).
### Model updates (note)
+
+- OpenAI released the GPT-5.4 family (including lower-latency mini/nano variants) used in agent deployments and enterprise runtimes in early 2026. When choosing a model, test both the full and compact variants for accuracy/latency/cost tradeoffs and include per-variant regression tests. Source: OpenAI News (Apr 2026): https://openai.com/index/gradient-labs and related announcements.
+
+- Agents SDK and model-native harnesses (Apr 15, 2026): OpenAI introduced a next-generation Agents SDK that includes native sandbox execution and a model-native harness. If you run agents, validate that your runtime supports sandboxing, short-lived credentials, and auditable execution traces. Do not assume behavior parity between an agent runtime and direct API calls—test both paths. Source: OpenAI News (Apr 15, 2026): https://openai.com/index/the-next-evolution-of-the-agents-sdk.
+
+- Chain-of-thought monitoring and misalignment detection (Mar 19, 2026): OpenAI published internal monitoring findings showing chain-of-thought traces can surface misalignment in coding agents. Where you collect thinking traces, make them auditable, subject to retention policies, and optionally detachable from user-facing answers for privacy and compliance. Source: OpenAI News (Mar 19, 2026): https://openai.com/index/how-we-monitor-internal-coding-agents-misalignment.
−- OpenAIreleasedtheGPT-5.4family(includinglower-latencymini/nanovariants)usedinagentdeploymentsandenterpriseruntimesinearly2026.Whenchoosingamodel,testboththefull and compactvariantsforaccuracy/latencytradeoffs. Source: OpenAINews (Apr 2026)andproduct announcements.
+- Gemini inference tiers (Apr 2, 2026): Google introduced Flex (cost-optimized,higher-latency)andPriority(high-reliability)inferencetiersfortheGeminiAPI.Routebackgroundorbatch-likeworktoFlexandinteractiveuser-facingrequeststoPriority;implementgracefuldowngradelogic and telemetry that recordswhichtierservedeachrequest. Source: GoogleAIBlog (Apr 2,2026):https://blog.google/innovation-and-ai/technology/developers-tools/introducing-flex-and-priority-inference/.
−- Google introduced Flex and Priority inference tiers for the Gemini API (Apr 2, 2026). Use Flex for cost-sensitive/background workloads and Priority for interactive, user-facing workloads; implement graceful-downgrade logic and monitor rate limits. Source: https://blog.google/innovation-and-ai/technology/developers-tools/introducing-flex-and-priority-inference/ (Apr 2, 2026).
### Prompt strategies
Research engine
OpenAI Prompt Engineering now combines 7 tracked sources with 1 trusted upstream skill packs. Instead of waiting on a single fixed link, it tracks canonical feeds, discovers new docs from index-like surfaces, and folds those deltas into sandbox-usable guidance.
OpenAI Prompt Engineering has unusually strong source quality and broad utility, so it deserves prominent placement.
Discovery process
1. Track canonical signals
Monitor 3 feed-like sources for release notes, changelog entries, and durable upstream deltas.
2. Discover net-new docs and leads
Scan 4 discovery-oriented sources such as docs indexes and sitemaps, then rank extracted links against explicit query hints instead of trusting nav order.
3. Transplant from trusted upstreams
Fold implementation patterns from OpenAI Docs so the skill inherits a real operating model instead of boilerplate prose.
4. Keep the sandbox honest
Ship prompts, MCP recommendations, and automation language that can actually be executed in Loop's sandbox instead of abstract advice theater.
System prompt anatomyStep 3: Implement structured output with OpenAIStep 4: Implement with AnthropicStep 5: Agents and orchestratorsEdge cases and gotchas
status
success
triggerAutomation
editoropenai/gpt-5-mini
duration152.7s
Diff▶
+8−7
+Generated: 2026-04-07T09:26:36.997Z
+Summary: This update incorporates recent vendor signals: OpenAI’s instruction-hierarchy research (IH-Challenge), the Responses API agent runtime (shell & container workspace), Anthropic’s XML-style prompt structuring, and Google Gemini’s new inference tiers. Edits clarify instruction priorities, agent orchestration best practices, tool output handling, and model-provider specifics for production prompt design.
+What changed: Added: explicit instruction-hierarchy guidance (System > Developer > User > Tool); expanded agent/orchestration section with Responses API and shell tool guidance; noted Anthropic XML tagging best practices; added Gemini inference-tier guidance. Rewrote: Edge cases and gotchas to reflect new signals. Kept: original structure, examples, and code patterns but updated text for vendor specifics.
−Generated:2026-04-05T09:54:36.216Z
+Body changed:yes
−Summary: OpenAI Prompt Engineering agent run was interrupted: Free credits temporarily have rate limits in place due to abuse. We are working on a resolution. Try again later, or pay for credits which continue to have unrestricted access. Pur
−What changed: Agent crashed mid-run after 0 search(es). (agent error: Free credits temporarily have rate limits in place due to abuse. We are working on a resolution. Try again later, or pay for credits which continue to have unrestricted access. Purchase credits at htt)
−Body changed: no
Editor: openai/gpt-5-mini
+Changed sections: System prompt anatomy, Step 3: Implement structured output with OpenAI, Step 4: Implement with Anthropic, Step 5: Agents and orchestrators, Edge cases and gotchas
Experiments:
+- Measure format-compliance improvements after replacing free-text outputs with zodResponseFormat across 3 production prompts
+- A/B test agent orchestrator loop lengths (1 vs 3 propose/execute cycles) to compare cost, latency, and accuracy trade-offs
−- Re-run after theissueisresolved.
+- Evaluate model behavior on instruction-conflict prompts beforeand after adding explicit developer instructionstoquantifyIHimprovements
−- Add a higher-signal source.
−- Check gateway credits or rate limits.
Signals:
- News (Anthropic News)
- Research (Anthropic News)
Update history4▶
Apr 7, 20264 sources
This update incorporates recent vendor signals: OpenAI’s instruction-hierarchy research (IH-Challenge), the Responses API agent runtime (shell & container workspace), Anthropic’s XML-style prompt structuring, and Google Gemini’s new inference tiers. Edits clarify instruction priorities, agent orchestration best practices, tool output handling, and model-provider specifics for production prompt design.
Apr 5, 20264 sources
OpenAI Prompt Engineering agent run was interrupted: Free credits temporarily have rate limits in place due to abuse. We are working on a resolution. Try again later, or pay for credits which continue to have unrestricted access. Pur
Apr 3, 20264 sources
OpenAI Prompt Engineering agent run was interrupted: Free credits temporarily have rate limits in place due to abuse. We are working on a resolution. Try again later, or pay for credits which continue to have unrestricted access. Pur
Apr 1, 20264 sources
OpenAI Prompt Engineering now tracks Google AI for Developers and 3 other fresh signals.