FlexInference: deadline-aware, OpenAI-compatible inference routing at about 47% lower cost. 3ms on a cold start fast.No fallback reliable.Outcome based affordable.
We find you cheaper inference in the SLA you provide.
Cost
−47%
cheaper
FlexInference
FlexInference
Latency
+4%
the trade
FlexInference
FlexInference
// pooled p50 across 10,000 requests
Built to not be annoying to debug at 2am
3 ms routing across 300 cities.
You reach the nearest edge in tens of milliseconds, because we run on Cloudflare Workers in 300 or more cities. Our own routing adds single-digit milliseconds, about 3 ms on a cold start. The wire does the waiting, not us.
Your key, locked with AES-256-GCM.
Your key stays yours. We encrypt your OpenAI, Gemini, or Anthropic key at rest with AES-256-GCM, proxy your request, and never store or read your prompts or replies. Because you bring your own key, the provider bills your account and you keep your own credits and discounts.
You get a real error, not a fake 200.
If you send wrong parameters, we do not strip them to force a win. We fail first and fail fast, and the provider's real status code and body pass straight through to you. You debug against the truth, not our guess.
Typed with pydantic and zod.
The Python SDK is typed with pydantic and the TypeScript SDK is typed with zod, for the parameters we can be sure about. So you get real types without us gating a provider feature that just launched, and your coding agents work great against it.
Errors your agent can fix itself.
Every error we return is built for a coding agent to fix. It keeps the OpenAI error shape and a machine-readable code, spells out what broke and the exact fix with an example, and links straight to the reference with a doc_url. And we go further. Our MCP server lets Claude, Cursor, and other agents search our docs, look up any error code, and manage your keys over OAuth.
Pay 47% less for 4% more latency.
We route your request through the cheaper flex tier first across providers to find you cheaper inference. If we hit a 429, a 5xx, or the request would miss your SLA, we upgrade it to standard on our own, so your systems never fail on our routing. Across a pooled p50 sample of about 18,000 requests, this nets about 47% lower cost for about 4% more latency.
No new SDK semantics to learn.
Point the base URL at FlexInference, send your FlexInference key, and set start_within on every request to start saving.
curl https://api.flexinference.com/v1/responses \
-H "Authorization: Bearer flex_live_..." \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-5.2",
"input": "Summarize this thread.",
"start_within": "00h-00m-30s"
}'First-party SDKs for Python and TypeScript. The OpenAI SDK above is the whole integration.
Only pay when we save you money.
Routing to inference providers through FlexInference is free. If you ask us to hunt for cheaper inference within the SLA you set via start_within, we charge 20% of what we save you - and nothing when we come up empty. So your everyday routing stays free, and you only pay when we actually make a request cheaper.
For example, a $10 request we bring down to $5 is charged an extra $1 as our fee - $6 total, a 40% discount overall.
See full pricing ->How to get lower inference costs?
- 1Sign in
- 2Add your OpenAI, Gemini, or Anthropic key
- 3Provision a FlexInference API key
- 4Setup SDK
- 5Enjoy lower inference bills