First POV · worked example

Self-hosted inference for a Singapore retail bank

TCO vs public API on Xinference Enterprise's published $15k/node pricing, plus a 2-week evaluation plan with SMART success criteria.

The deliverable I'd hand a prospect in week one. Hadi Al-Hazim, Singapore.

The number

A MAS-regulated bank can't send production data to any public API, so the question isn't price against the API; it's whether sovereignty costs a premium. Barely: self-hosting on Xinference's $15k/node lands about 38% under the production-API tier (Claude Sonnet class) and within ~2% of the cheapest, pays back in ~16 months against the quality tier, and keeps every token in the bank's VPC. The 38% is their own "40% lower infra cost" claim, reconstructed on real pricing.

Monthly figures at the workload below; the table prices the production rollout (the 2-week POV itself runs on one node). The bank can't send production data to either API; they're the cost yardstick a board benchmarks against.
	Public API GPT-4.1	Public API Claude Sonnet 4.6	Xinference self-hosted 2 nodes · 8× H100 · on-prem, N+1
Monthly run-rate	$15,840	$26,136	$16,200
Cost / 1,000 queries	$10.00	$16.50	$10.24
Upfront	$0	$0	~$260k
Self-host break-even vs this API	n/a · sovereignty call	~month 16	—
3-year TCO	$570k	$941k	$584k
Production data egress	leaves VPC + jurisdiction	leaves VPC + jurisdiction	zero; stays on-prem

Assumptions (every input public or labeled; tune with your actuals):

Workload: 6,000 internal users (RMs, contact-centre, ops) × 12 queries/working day × 22 days = 1.58M queries/mo; per query ≈ 3,000 input + 500 output tokens.
API price: GPT-4.1 $2 / $8 per M tokens¹; Claude Sonnet 4.6 $3 / $15². List, June 2026.
Sizing: output load ≈ 792M tokens/mo ≈ 1,000 tok/s across business hours, ~3,000 at peak. A 4× H100 node sustains ~3,245 output tok/s on a 70B-class model (public vLLM benchmark⁵); one node carries peak on a thin ~8% margin, so a second is N+1 failover that also absorbs p95 bursts while both are active, degrading to single-node capacity on failure. Production = 2 nodes.
Self-hosted cost: ~$260k upfront (8× H100 @ ~$30k⁴ + integration; capex over 36 mo); then ~$9,000/mo (~$6,500 power, cooling, 0.5 FTE ops + Xinference Enterprise $15k/node/yr³ for 2 nodes, amortized). Fully loaded ≈ $16,200/mo.
Crossover moves with volume and utilization: below this workload the cheapest API wins on price, so self-hosting here is a sovereignty decision that also beats the quality tier; fill idle GPU time with overnight batch and the per-token cost drops further.

The wedge

A MAS-regulated bank can't route customer records, transaction data or KYC documents to a hosted LLM API; the data leaves its VPC and its jurisdiction. That rules out OpenAI and Anthropic endpoints for any workload touching production data. Self-hosting open models on the bank's own infrastructure is the option that keeps inference inside the perimeter, which is why the table above prices sovereignty instead of asking whether to pursue it.

2-week POV: success criteria the bank co-signs

Fully-loaded cost/1k queries ≤ $11, below the production-API tier ($16.50 on Claude Sonnet) at the same token shape, measured on the bank's own hardware bill.
p95 latency ≤ 2.5s at 50 concurrent users on a 70B-class open model served through Xinference's OpenAI-compatible endpoint.
Zero data egress: default-deny network policy on the inference node; no outbound calls, evidenced in flow logs.
Functional parity ≥ 95% on a 100-prompt regression set drawn from the bank's top intents, graded against the incumbent API's answers.

Scope & environment

One node (4× H100) for the eval; production sizing is 2 nodes.
One 70B-class open model; one OpenAI-compatible endpoint swap.
Week 1 anonymized data; week 2 controlled production intents.
Out of scope (staged for rollout): multi-model, fine-tuning, HA.

Who does what

Bank infra: provisions the node and network policy.
Bank business owner: supplies top-20 intents and the 100-prompt set.
Solutions Engineer: stands up the cluster, instruments latency and cost, runs the regression, delivers the readout.

Deployment, verified locally

Local docker run of xprobe/xinference launching a model and returning an OpenAI-compatible completion via curl. — Stood up on a CPU host in ~30 min; proves stand-up and the API contract. Not a throughput test.

Independent worked example by Hadi Al-Hazim. All pricing public; workload and ops figures labeled and tunable. No bank data used. June 2026.
Sources: ¹ OpenAI API pricing · ² Anthropic pricing · ³ Xinference Enterprise pricing · ⁴ H100 street price · ⁵ vLLM throughput