Streaming
Set stream: true on a chat completion and Recovea returns the response as server-sent events (SSE), the same wire format as OpenAI, byte for byte. Your SDK's streaming iterator works unchanged. Recovea flushes every chunk the instant it's produced (no buffering), so the gateway adds no streaming overhead of its own — first-token latency is your provider's, and our budget for gateway-added overhead is single-digit-to-low-double-digit milliseconds, not a guaranteed SLO.
Turn it on
from openai import OpenAI
client = OpenAI(
base_url="https://api.recovea.ai/v1",
api_key="rcv_live_…",
)
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a haiku about latency."}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
Each iteration yields one chunk. The text lives on choices[0].delta.content, on delta, not message. Early and terminal chunks carry an empty delta, so guard for None (the or "" above).
The SSE framing
The response is Content-Type: text/event-stream. Each event is a single line:
data: {json}
followed by a blank line. There are no event: lines for chat completions. The stream is data-only. Every chunk object is "chat.completion.chunk" and shares the same id, created, and model across the whole response:
{
"id": "chatcmpl-abc",
"object": "chat.completion.chunk",
"created": 1717000000,
"model": "gpt-4o",
"choices": [{ "index": 0, "delta": { "content": "Hel" }, "finish_reason": null }]
}
The sequence is:
- First chunk:
deltacarriesrole: "assistant", no content yet. - Middle chunks:
delta.contentcarries successive text fragments. - Terminal chunk: an empty
delta: {}withfinish_reasonset (stop,length,tool_calls,content_filter). - Terminator: exactly
data: [DONE], then the stream closes.
finish_reason is null on every chunk except the terminal one. The SDK loops until it sees [DONE], then stops the iterator.
Raw SSE over curl
Pass "stream": true and read the wire directly:
curl -N https://api.recovea.ai/v1/chat/completions \
-H "Authorization: Bearer rcv_live_…" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Count to three."}],
"stream": true
}'
-N disables curl's own buffering so you see chunks arrive in real time. The body looks like:
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1717000000,"model":"gpt-4o","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1717000000,"model":"gpt-4o","choices":[{"index":0,"delta":{"content":"One"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1717000000,"model":"gpt-4o","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Streaming tool calls
When the model invokes a tool, the call streams as delta.tool_calls instead of delta.content. The id and function.name arrive on the first fragment; function.arguments is streamed in pieces that you concatenate by index:
{
"choices": [{
"index": 0,
"delta": {
"tool_calls": [{
"index": 0,
"id": "call_xyz",
"type": "function",
"function": { "name": "get_weather", "arguments": "{\"ci" }
}]
},
"finish_reason": null
}]
}
Subsequent chunks carry only function.arguments fragments ("ty\": \"Paris\"}"). The terminal chunk sets finish_reason: "tool_calls". Accumulate the arguments string across fragments before parsing it as JSON.
Token usage
By default streamed responses don't include a usage object. To get one, set stream_options:
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hi"}],
stream=True,
stream_options={"include_usage": True},
)
Your provider then emits one extra chunk before [DONE] with choices: [] (empty) and a populated usage object, and Recovea passes it through byte for byte. Every earlier chunk keeps usage: null.
Headers and fail-open
The stream response carries x-request-id and Recovea's additive x-recovea-trace-id, so any streamed call is auditable back to your ledger. Recovea meters the stream asynchronously off the response, never by holding the first byte.
Streaming is fail-open, same as the rest of the API. If anything in Recovea's optimization layer fails mid-setup, the request flows straight through to your provider on your own key. You get a correct, well-formed SSE stream either way.
Next
- Chat Completions: the full request and response reference
- Errors: status codes and the error envelope