Question 1

What is start_within?

Accepted Answer

A required field on every request. Set it to default, priority, auto, or a duration HHh-MMm-SSs (5 seconds to 10 minutes). In the duration you provide we search for cheaper inference through different providers that support the model you selected. If we can't find cheaper inference in that time, we escalate your request to a default request. This way you get a response, but whenever possible, for cheaper.

Question 2

How much does FlexInference save?

Accepted Answer

On average across 10,000 requests, flex routing cut cost about 47% (pooled p50 of $0.153 down to $0.081 per request) for about 4% higher latency (1312 ms to 1360 ms). Actual savings depend on your models, providers, and the deadline you set. Anthropic is proxy-only and does not get the flex savings.

Question 3

Which providers and models do you support?

Accepted Answer

All OpenAI, Gemini, and Anthropic models. Only those that support flex are available for cheaper inference search.

Question 4

Do you have an MCP server?

Accepted Answer

Yes. Point any MCP client such as Claude or Cursor at https://mcp.flexinference.com/mcp. Two docs tools work with no login. They search the docs and look up any error code. After a one-time OAuth sign-in, an agent can also read your usage and savings, list, create, or revoke API keys, and check billing, each behind its own scope. We never take a raw provider key over MCP, so add those on the dashboard. See the MCP docs for setup.

Question 5

Are your errors built for coding agents?

Accepted Answer

Yes. Every error we return keeps the OpenAI error shape, so your client parses it unchanged, and adds a machine-readable code, the offending field, and a doc_url that links straight to the reference. The message states what broke, the rule you hit, and the exact fix with an example, so branching on the code and showing the message is usually enough for you or your agent to fix it on the first try.

Question 6

Do Anthropic models get the cheaper inference search?

Accepted Answer

No. Anthropic is proxy-only: no flex race and no cost savings. You get the same unified API, your key encrypted at rest, and service-tier control (default, auto), and you draw down your own Anthropic credits. Call any claude-* model on /v1/responses, or use the Anthropic Messages format directly on /v1/messages.

Question 7

Do you support caching?

Accepted Answer

Yes lexical caching, semantic caching coming soon.

Question 8

Do you support fallback providers?

Accepted Answer

Not yet, coming soon.

Question 9

Do you support upstream retries (to the LLM providers)?

Accepted Answer

Not yet, coming soon.

Question 10

Do you support automatic model selection?

Accepted Answer

Not yet, coming soon. (The start_within value auto is OpenAI's service tier, not this.)

Question 11

Do you charge a fee per request?

Accepted Answer

There is no flat per-request fee. A flex request that beats the standard price carries a commission of 20% of what it saves you, so you only ever pay when you actually save; if a request saves nothing, it is free. Standard, priority, and auto routing and all Anthropic models have no fee at all.

Question 12

How do you make money?

Accepted Answer

We charge a commission of 20% of the money a flex request actually saves you versus the standard price. If a request saves you nothing, it is free, and standard, priority, and auto routing and all Anthropic models are always free. Commission accrues per request and is billed monthly.

Question 13

When and how am I charged?

Accepted Answer

Commission accrues per request and is billed monthly, but only once your accrued balance reaches $20; a smaller balance carries forward to the next month. Add a card on file in the dashboard Billing page and we charge it automatically when a bill is due. If a payment is past due, billable flex is paused and priced flex requests return a 402, while free routing (default, priority, auto) and all Anthropic models keep working; update your card to resume.

Question 14

Can I still save on inference if I already have provider credits?

Accepted Answer

Yes. Bring your own API key, which we encrypt at rest, and your requests draw down your existing credits.

Question 15

Do you see my requests or responses?

Accepted Answer

No we don't see your information.

FlexInference: deadline-aware, OpenAI-compatible inference routing at about 47% lower cost. 3ms on a cold start fast.No fallback reliable.Outcome based affordable.

Built to not be annoying to debug at 2am

3 ms routing across 300 cities.

Your key, locked with AES-256-GCM.

You get a real error, not a fake 200.

Typed with pydantic and zod.

Errors your agent can fix itself.

Pay 47% less for 4% more latency.

No new SDK semantics to learn.

Only pay when we save you money.

How to get lower inference costs?

Frequently Asked Questions