2. CDN routes to the nearest function region (or the configured region)
3. If no warm instance exists, a cold start provisions a new isolate/container
4. Function executes, returns a response
−5. Instance stays warm for subsequent requests (Fluid compute reduces cold starts)
+5. Instance stays warm for subsequent requests (Fluid compute reduces cold-start frequency and tail latency)
6. After idle timeout, instance is recycled
−### Fluid compute (updated)
+### Fluid compute (accurate references)
−Fluid compute is Vercel's hybrid execution model that reduces cold starts by keeping high-performance “server-like” instances warm while preserving serverless scaling and billing. Key facts from Vercel's docs:
+Fluid compute is Vercel's hybrid execution model that reduces cold starts by keeping higher-performance instances warm while preserving serverless scaling and billing. Key, documented facts:
+- Fluid compute is enabled by default for new projects (as of April 23, 2025). (Vercel Docs — Fluid compute, Jan 29, 2026)
+- Supported runtimes: Node.js, Python, Edge, Bun, Rust (see the Fluid compute docs for full list). (Vercel Docs — Fluid compute, Jan 29, 2026)
−- Fluid compute is enabled by default for new projects (since April 23, 2025).
+- Capabilities: optimized in-function concurrency (multiple invocations per instance), automatic bytecode optimization/caching, pre-warming of instances, background post-response work via waitUntil, and zone/region failover. These reduce container provisioning and tail latencies but do not eliminate heavy module initialization costs. (Vercel Docs — Fluid compute, Jan 29, 2026)
−- It supports optimized concurrency (multiple invocations per instance), background post-response work via waitUntil, automatic bytecode caching, and automatic pre-warming to reduce cold-start tail latencies. (Vercel Docs, last updated Jan 29, 2026)
−- Fluid compute is designed to improve performance for I/O-bound and AI/streaming workloads while keeping billing tied to actual compute usage.
Operational guidance:
+- Prefer Fluid compute for APIs, streaming endpoints, or AI inference paths where cold-start tail latency matters.
+- Keep application/module initialization minimal — Fluid compute reduces container provisioning but expensive top-level initialization still impacts first invocation performance for a process.
+- Use waitUntil for short background work that can complete after the response (telemetry, logs, non-critical work).
−- Prefer Fluid compute for APIs, streaming endpoints, and AI inference paths where cold-start tail latency matters.
+- Enable Fluid compute per-project or per-deployment via the dashboard or vercel.json when you want controlled rollout. (Vercel Docs — Fluid compute, Jan 29, 2026)
−- Keep module initialization small; Fluid compute reduces container provisioning but does not eliminate expensive module init work.
−- Use waitUntil for post-response background work when you want the request to return quickly but still complete logging, telemetry, or non-critical work.
−
−(References: Vercel Docs — Fluid compute, vercel.com/docs/fluid-compute, Jan 29, 2026)
### Cold start anatomy and mitigation (refined)
Phases and mitigations:
- Container provision: mitigated by Fluid compute and pre-warming
−- Runtime init: prefer Edge runtimes where feasible (V8 isolates) for minimal runtime init
+- Runtime init: where Edge runtimes (V8 isolates) win — prefer Edge for small, latency-sensitive middleware
- Code/module init: minimize heavy top-level work; lazy-load DB clients and SDKs
- Handler execution: keep hot paths efficient; external calls determine much of observed latency
Additional mitigations:
−- Acknowledge webhooks quickly and offload work to Vercel Queues
+- Acknowledge webhooks quickly and offload heavy work to Vercel Queues
- Use Edge functions for auth/geo-routing checks to avoid a full container path
−- Monitor P95/P99 cold-start telemetry and focus optimization on the highest-impact functions
+- Instrument P95/P99 cold-start telemetry and optimize the highest-impact functions first
−### Function types and durations (clarified)
+### Function types and durations (explicit)
+- Configure per-function maximum duration using the recommended patterns: for modern Next.js and Node.js-based routes you can export a numeric maxDuration from the function file (for example, export const maxDuration = 5). For other runtimes or older frameworks, set the functions.maxDuration in vercel.json or update the project default in the dashboard. (Vercel Docs — Configuring Maximum Duration, Feb 27, 2026)
+- Duration is wall-clock time: streamed responses, waits, and I/O count toward the duration limit. Plan accordingly for streaming or long-polling endpoints. (Vercel Docs — Configuring Maximum Duration, Feb 27, 2026)
−- Default maxDuration is enforced; configure per-function via `export const maxDuration = N` for Node.js and supported runtimes. (Vercel Docs — Configuring Maximum Duration, updated Feb 27, 2026)
+- Defaults and limits (verify in your dashboard): Hobby default/maximum 300s (5m); Pro/Enterprise projects may configure higher limits depending on plan and region. Always verify plan-specific limits in the dashboard. (Vercel Docs — Configuring Maximum Duration, Feb 27, 2026)
−- Typical defaults: Hobby 300s; Pro/Enterprise can configure higher limits depending on plan and deployment. Always verify plan limits in the dashboard and docs.
−- Streaming and background work still count toward wall-clock duration; use `waitUntil` for short post-response work when appropriate.
−
−(References: Vercel Docs — Configuring Maximum Duration, Feb 27, 2026)
## Workflow
@@ −85 +82 @@
### Step 4 — Streaming responses for AI
+- Streaming endpoints are supported but count toward maxDuration. If you need longer-lived streaming, evaluate Vercel Workflow or dedicated streaming services.
−- Streaming endpoints are supported but count toward maxDuration. If you need longer-lived streaming, assess Workflow or a dedicated service.
+- For AI streaming, Fluid compute is the preferred option to reduce tail latency due to pre-warming and in-function concurrency; still minimize module init costs. (Vercel Docs — Fluid compute, Jan 29, 2026)
−- For AI streaming, prefer Fluid compute for improved tail latencies on cold starts; still limit module init costs.
## Optimize cold starts (practical patterns)
- Lazy-load heavy dependencies only when needed (Stripe, DB SDKs, ML models).
- Use shared/global variables for pooled clients (DB, HTTP client) — they persist across warm invocations but must not hold request-scoped state.
−- For latency-sensitive endpoints, prefer Edge runtime for the authentication/validation path.
+- For latency-sensitive endpoints, prefer Edge runtime for authentication/validation paths.
- For webhooks, respond immediately then enqueue to Queues; for critical synchronous work, ensure your maxDuration is sufficient and allocate extra time.
−## Vercel Queues (updated)
+## Vercel Queues (accurate and current)