A2AUserv3FreePublic

OpenAI RAG Pipelines

OpenAI RAG Pipelines now tracks News and 3 other fresh signals.

LoopVerified7 sources · Updated Apr 3, 2026

Content

OpenAI RAG Pipelines

Build retrieval-augmented generation systems that ground LLM responses in your own data using embeddings, vector search, and context window management.

When to use

The model needs access to proprietary data not in its training set
Building a Q&A system over documentation, codebases, or knowledge bases
Need grounded, citation-backed answers instead of potential hallucinations
Data changes frequently and re-training is impractical
Implementing semantic search over large document collections

When NOT to use

The answer is in the model's training data and doesn't need grounding
You have fewer than 50 documents — just put them in the context window
Real-time data is needed (stock prices, live APIs) — use tool calling instead
The query is transactional, not informational (CRUD operations)
Exact keyword match is sufficient — use a traditional search engine

Core concepts

RAG pipeline architecture

┌────────────────────────────────────────────────────┐
│                   Ingestion Pipeline                │
│                                                      │
│  Documents → Chunking → Embedding → Vector Store     │
│    (PDF,      (split     (OpenAI      (Pinecone,     │
│     MD,       into       text-         Supabase       │
│     HTML)     chunks)    embedding-3)  pgvector)      │
└────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────┐
│                   Query Pipeline                     │
│                                                      │
│  User Query → Embed → Vector Search → Rerank →       │
│                                                      │
│  → Build Prompt (query + retrieved chunks) →          │
│                                                      │
│  → LLM Generation → Response with citations           │
└────────────────────────────────────────────────────┘

When to use

The model needs access to proprietary data not in its training set

Building a Q&A system over documentation, codebases, or knowledge bases

Need grounded, citation-backed answers instead of potential hallucinations

Data changes frequently and re-training is impractical

Implementing semantic search over large document collections

When NOT to use

The answer is in the model's training data and doesn't need grounding

You have fewer than 50 documents — just put them in the context window

Real-time data is needed (stock prices, live APIs) — use tool calling instead

The query is transactional, not informational (CRUD operations)

Exact keyword match is sufficient — use a traditional search engine

Core concepts

RAG pipeline architecture

┌────────────────────────────────────────────────────┐ │ Ingestion Pipeline │ │ │ │ Documents → Chunking → Embedding → Vector Store │ │ (PDF, (split (OpenAI (Pinecone, │ │ MD, into text- Supabase │ │ HTML) chunks) embedding-3) pgvector) │ └────────────────────────────────────────────────────┘ ┌────────────────────────────────────────────────────┐ │ Query Pipeline │ │ │ │ User Query → Embed → Vector Search → Rerank → │ │ │ │ → Build Prompt (query + retrieved chunks) → │ │ │ │ → LLM Generation → Response with citations │ └────────────────────────────────────────────────────┘

Strategy	When to use	Chunk size
Fixed-size	Simple, uniform documents	500-1000 tokens
Sentence-based	Prose, articles, documentation	3-5 sentences
Paragraph-based	Well-structured documents	1-3 paragraphs
Semantic	Mixed-format, variable structure	Dynamic (by topic)
Header-based	Documentation with clear headings	Per section
Code-aware	Source code repositories	Per function/class

Model	Dimensions	Max tokens	Use case
text-embedding-3-small	1536	8191	Cost-effective general use
text-embedding-3-large	3072	8191	High-accuracy retrieval
text-embedding-ada-002	1536	8191	Legacy (use 3-small instead)

Criterion	How to measure
Retrieval precision@K	% of top-K results that are relevant to the query
Retrieval recall	% of relevant documents found in top-K
Answer accuracy	% of answers that match ground-truth (human eval or LLM judge)
Citation accuracy	% of citations that point to the actual source of the claim
Hallucination rate	% of claims not supported by any retrieved chunk
Latency	End-to-end time from query to response (target: < 3s)
Ingestion throughput	Documents per second during bulk ingestion
Cost per query	Embedding cost + vector search cost + LLM generation cost

Su	Mo	Tu	We	Th	Fr	Sa

OpenAI RAG Pipelines

Content

OpenAI RAG Pipelines

When to use

When NOT to use

Core concepts

RAG pipeline architecture

OpenAI RAG Pipelines

Content

OpenAI RAG Pipelines

When to use

When NOT to use

Core concepts

RAG pipeline architecture

Chunking strategies

Embedding models

Workflow

Step 1: Document chunking

Step 2: Generate embeddings

Step 3: Store in vector database (Pinecone)

Step 4: Alternative — Supabase pgvector

Step 5: Build the RAG prompt

Step 6: End-to-end RAG query

Examples

Example 1: Hybrid search (semantic + keyword)

Example 2: Context window management

Example 3: Incremental ingestion

Decision tree

Edge cases and gotchas

Evaluation criteria

Research-backed changes

Fresh signals

Next edits

Activity

Automation & run history

Latest refresh trace

Research engine

Sources