Build retrieval-augmented generation systems that ground LLM responses in your own data using embeddings, vector search, and context window management.
When to use
The model needs access to proprietary data not in its training set
Building a Q&A system over documentation, codebases, or knowledge bases
Need grounded, citation-backed answers instead of potential hallucinations
Data changes frequently and re-training is impractical
Implementing semantic search over large document collections
When NOT to use
The answer is in the model's training data and doesn't need grounding
You have fewer than 50 documents — just put them in the context window
Real-time data is needed (stock prices, live APIs) — use tool calling instead
The query is transactional, not informational (CRUD operations)
Exact keyword match is sufficient — use a traditional search engine
import { Pinecone } from "@pinecone-database/pinecone";
const pinecone = new Pinecone();
const index = pinecone.index("knowledge-base");
async function upsertChunks(chunks: Array<Chunk & { embedding: number[] }>) {
const vectors = chunks.map(chunk => ({
id: chunk.id,
values: chunk.embedding,
metadata: {
text: chunk.text,
source: chunk.metadata.source,
title: chunk.metadata.title,
section: chunk.metadata.section,
},
}));
// Upsert in batches of 100
for (let i = 0; i < vectors.length; i += 100) {
await index.upsert(vectors.slice(i, i + 100));
}
}
async function queryVectors(
embedding: number[],
topK = 5,
filter?: Record<string, string>
) {
const result = await index.query({
vector: embedding,
topK,
includeMetadata: true,
filter,
});
return result.matches?.map(match => ({
id: match.id,
score: match.score ?? 0,
text: (match.metadata?.text as string) ?? "",
source: (match.metadata?.source as string) ?? "",
title: (match.metadata?.title as string) ?? "",
})) ?? [];
}
Step 4: Alternative — Supabase pgvector
import { createClient } from "@supabase/supabase-js";
const supabase = createClient(
process.env.SUPABASE_URL!,
process.env.SUPABASE_SERVICE_KEY!
);
async function upsertChunksPgvector(chunks: Array<Chunk & { embedding: number[] }>) {
const rows = chunks.map(chunk => ({
id: chunk.id,
content: chunk.text,
embedding: chunk.embedding,
metadata: chunk.metadata,
}));
const { error } = await supabase.from("documents").upsert(rows);
if (error) throw error;
}
async function queryPgvector(embedding: number[], topK = 5) {
const { data, error } = await supabase.rpc("match_documents", {
query_embedding: embedding,
match_threshold: 0.7,
match_count: topK,
});
if (error) throw error;
return data;
}
// Required SQL function for pgvector similarity search:
// CREATE FUNCTION match_documents(
// query_embedding vector(1536),
// match_threshold float,
// match_count int
// ) RETURNS TABLE (id text, content text, similarity float)
// LANGUAGE plpgsql AS $$
// BEGIN
// RETURN QUERY
// SELECT d.id, d.content,
// 1 - (d.embedding <=> query_embedding) as similarity
// FROM documents d
// WHERE 1 - (d.embedding <=> query_embedding) > match_threshold
// ORDER BY d.embedding <=> query_embedding
// LIMIT match_count;
// END;
// $$;
Step 5: Build the RAG prompt
function buildRagPrompt(
query: string,
retrievedChunks: Array<{ text: string; source: string; title: string; score: number }>
): string {
const contextBlock = retrievedChunks
.map((chunk, i) =>
\`[Source \${i + 1}: \${chunk.title} (\${chunk.source})]\n\${chunk.text}\`
)
.join("\n\n---\n\n");
return \`Answer the user's question based on the provided context.
## Rules
- Only use information from the provided context
- Cite sources using [Source N] notation
- If the context doesn't contain the answer, say "I don't have enough information"
- Do not make up information not present in the context
- Prefer the most relevant source when multiple sources agree
## Context
\${contextBlock}
## Question
\${query}\`;
}
Step 6: End-to-end RAG query
async function ragQuery(userQuestion: string) {
// 1. Embed the question
const queryEmbedding = await embedQuery(userQuestion);
// 2. Retrieve relevant chunks
const chunks = await queryVectors(queryEmbedding, 5);
// 3. Build the augmented prompt
const augmentedPrompt = buildRagPrompt(userQuestion, chunks);
// 4. Generate the response
const response = await openai.responses.create({
model: "gpt-4o",
instructions: "You are a helpful assistant that answers questions based on provided context.",
input: augmentedPrompt,
});
return {
answer: response.output_text,
sources: chunks.map(c => ({ title: c.title, source: c.source, score: c.score })),
};
}
function fitChunksToWindow(
chunks: Array<{ text: string; score: number }>,
maxContextTokens: number
): string[] {
const selected: string[] = [];
let totalTokens = 0;
// Sort by relevance score (highest first)
const sorted = [...chunks].sort((a, b) => b.score - a.score);
for (const chunk of sorted) {
const chunkTokens = estimateTokens(chunk.text);
if (totalTokens + chunkTokens > maxContextTokens) break;
selected.push(chunk.text);
totalTokens += chunkTokens;
}
return selected;
}
// Budget: model context - system prompt - output reserve
const maxContext = 128000 - 2000 - 4000; // for gpt-4o
Example 3: Incremental ingestion
async function ingestNewDocuments(docs: Document[]) {
// Only process documents not already in the vector store
const existingIds = new Set(await getStoredDocumentIds());
const newDocs = docs.filter(d => !existingIds.has(d.id));
if (newDocs.length === 0) return { ingested: 0 };
const chunks = newDocs.flatMap(doc => chunkDocument(doc));
const embedded = await embedChunks(chunks);
await upsertChunks(embedded);
return { ingested: newDocs.length, chunks: embedded.length };
}
Decision tree
Do you need external knowledge in the LLM response?
├── No → Standard prompting (no RAG needed)
└── Yes
├── How much data?
│ ├── < 50 docs / < 100K tokens → Put it all in context (no vector DB)
│ ├── 50-10K docs → Single vector index with metadata filters
│ └── 10K+ docs → Partitioned indexes or hybrid search
├── Data freshness?
│ ├── Static → One-time ingestion
│ ├── Weekly updates → Batch re-ingestion
│ └── Real-time → Incremental ingestion + cache invalidation
├── Search type?
│ ├── Semantic similarity → Vector search only
│ ├── Exact keyword match → Full-text search only
│ └── Both → Hybrid search with RRF
└── Vector store?
├── Already using Supabase → pgvector extension
├── Need managed service → Pinecone
├── Need open source → Qdrant, Weaviate, Milvus
└── Prototyping → In-memory with cosine similarity
Edge cases and gotchas
Chunk boundary artifacts: Important information split across two chunks — use overlap to mitigate
Embedding model mismatch: Query and document embeddings must use the same model — mixing models breaks similarity
Metadata filtering: Filter before vector search, not after — otherwise you get irrelevant results within the top K
Stale embeddings: When documents update, old embeddings remain — implement a re-indexing strategy
Score calibration: Cosine similarity scores are not comparable across queries — use rank, not absolute score
Context window overflow: Retrieved chunks + prompt + expected output must fit in the model's context window
Hallucination despite RAG: The model may still hallucinate even with retrieved context — add "only use provided context" instructions
Multi-language: Embedding models work best in English — multilingual retrieval needs multilingual embedding models
Code vs. prose: Code chunks need different splitting strategies (by function/class, not by paragraph)
Check OpenAI for new embedding models or retrieval-API changes. Scan Hugging Face for open embedding model releases. Monitor LangChain for retriever and reranker updates. Check Supabase/Neon for pgvector improvements. Update chunking strategies, similarity-search patterns, and evaluation benchmarks.
Latest refresh trace
Reasoning steps, source results, and the diff that landed.
Apr 12, 2026 · 9:46 PM
triggerManual
editoropenai/gpt-5-mini
duration171.8s
statussuccess
web searches2
sources discovered+3
No material changes
OpenAI RAG Pipelines was reviewed by the editor agent but no revision was applied.
The agent analyzed signals but did not call revise_skill.
Agent steps
Step 1Started scanning 8 sources.
Step 2OpenAI News: 12 fresh signals captured.
Step 3OpenAI Platform Changelog: No fresh signals found.
Step 4Anthropic News: 12 fresh signals captured.
Step 5Hugging Face Blog: 12 fresh signals captured.
+Summary: OpenAI RAG Pipelines was reviewed by the editor agent but no revision was applied.
+What changed: The agent analyzed signals but did not call revise_skill.
−Generated:2026-04-11T09:26:36.405Z
+Body changed:no
−Summary: Updated the RAG Pipelines guide to incorporate 2026 signals: OpenAI's GPT-5.4 family and Responses API agent runtime guidance; Hugging Face SentenceTransformers v5.4 multimodal embeddings and rerankers; and a pgvector 0.8.2 security release (CVE-2026-3172). Added a Multimodal RAG section, explicit security/dependency notes, and operational guidance for async subagents and model benchmarking.
−What changed: - Added a new 'Multimodal RAG' section describing cross-modal embeddings and reranking (Hugging Face v5.4).
−- Expanded 'Embedding models' to mention GPT-5.4 for generation and to advise checking OpenAI docs for new embedding releases.
−- Added a 'Security and dependency notes' section calling out pgvector 0.8.2 and CVE-2026-3172, plus upgrade/runbook guidance.
−- Expanded operational notes to reference Responses API computer environment and LangChain Deep Agents async subagents.
−- Minor clarifications on batching, token budgeting, and reranking top-N recommendations.
−Body changed: yes
Editor: openai/gpt-5-mini
−Changed sections: Embedding models (2026 guidance), Multimodal RAG, Security and dependency notes, Operational notes (2026 signals), Orchestration and long-running work
Experiments:
+- Re-run after the issue is resolved.
+- Add a higher-signal source.
−- Benchmarkmultimodalembeddings/rerankers(SentenceTransformersv5.4)on a held-out multimodal retrieval task (precision@K, latency, cost).
+- Checkgatewaycreditsorratelimits.
−- Run an A/B test comparing LLM-based reranker vs cross-encoder reranker (accuracy vs latency/cost) and evaluate async subagent patterns to move heavy reranking off the critical path.
Signals:
- News (Anthropic News)
- Research (Anthropic News)
Research engine
OpenAI RAG Pipelines now combines 7 tracked sources with 2 trusted upstream skill packs. Instead of waiting on a single fixed link, it tracks canonical feeds, discovers new docs from index-like surfaces, and folds those deltas into sandbox-usable guidance.
OpenAI RAG Pipelines has unusually strong source quality and broad utility, so it deserves prominent placement.
Discovery process
1. Track canonical signals
Monitor 5 feed-like sources for release notes, changelog entries, and durable upstream deltas.
2. Discover net-new docs and leads
Scan 2 discovery-oriented sources such as docs indexes and sitemaps, then rank extracted links against explicit query hints instead of trusting nav order.
3. Transplant from trusted upstreams
Fold implementation patterns from OpenAI Docs, Vercel AI SDK so the skill inherits a real operating model instead of boilerplate prose.
4. Keep the sandbox honest
Ship prompts, MCP recommendations, and automation language that can actually be executed in Loop's sandbox instead of abstract advice theater.
+Summary: OpenAI RAG Pipelines agent run was interrupted: Free credits temporarily have rate limits in place due to abuse. We are working on a resolution. Try again later, or pay for credits which continue to have unrestricted access. Pur
+What changed: Agent crashed mid-run after 0 search(es). (agent error: Free credits temporarily have rate limits in place due to abuse. We are working on a resolution. Try again later, or pay for credits which continue to have unrestricted access. Purchase credits at htt)
+Body changed: no
−Generated:2026-04-01T12:27:08.221Z
+Editor:openai/gpt-5-mini
−Summary: OpenAI RAG Pipelines now tracks News and 3 other fresh signals.
−What changed: The biggest delta is News. Fold the concrete changes into the operating notes, then discard the fluff.
−Body changed: yes
−Editor: heuristic-fallback
−Changed sections: Research-backed changes
Experiments:
+- Re-run after the issue is resolved.
+- Add a higher-signal source.
−- Addonenewtacticbasedon News.
+- Checkgatewaycreditsorratelimits.
−- Rewrite one stale section with today's source language.
−- Turn the top signal into a reusable agent prompt.
Signals:
+- Claude Code (Anthropic News)
+- Claude Code Enterprise (Anthropic News)
+- Claude Cowork (Anthropic News)
−- News (Anthropic News)
+- Claude forChrome (Anthropic News)
−- Research (Anthropic News)
−- Economic Futures (Anthropic News)
−- Try Claude (Anthropic News)
Update history2▶
Apr 3, 20264 sources
OpenAI RAG Pipelines agent run was interrupted: Free credits temporarily have rate limits in place due to abuse. We are working on a resolution. Try again later, or pay for credits which continue to have unrestricted access. Pur
Apr 1, 20264 sources
OpenAI RAG Pipelines now tracks News and 3 other fresh signals.