TCO vs public API on Xinference Enterprise's published $15k/node pricing, plus a 2-week evaluation plan with SMART success criteria.
The deliverable I'd hand a prospect in week one. Hadi Al-Hazim, Singapore.
The number
A MAS-regulated bank can't send production data to any public API, so the question isn't price against the API; it's whether sovereignty costs a premium. Barely: self-hosting on Xinference's $15k/node lands about 38% under the production-API tier (Claude Sonnet class) and within ~2% of the cheapest, pays back in ~16 months against the quality tier, and keeps every token in the bank's VPC. The 38% is their own "40% lower infra cost" claim, reconstructed on real pricing.
Monthly figures at the workload below; the table prices the production rollout (the 2-week POV itself runs on one node). The bank can't send production data to either API; they're the cost yardstick a board benchmarks against.
Assumptions (every input public or labeled; tune with your actuals):
Workload: 6,000 internal users (RMs, contact-centre, ops) × 12 queries/working day × 22 days = 1.58M queries/mo; per query ≈ 3,000 input + 500 output tokens.
API price: GPT-4.1 $2 / $8 per M tokens1; Claude Sonnet 4.6 $3 / $152. List, June 2026.
Sizing: output load ≈ 792M tokens/mo ≈ 1,000 tok/s across business hours, ~3,000 at peak. A 4× H100 node sustains ~3,245 output tok/s on a 70B-class model (public vLLM benchmark5); one node carries peak on a thin ~8% margin, so a second is N+1 failover that also absorbs p95 bursts while both are active, degrading to single-node capacity on failure. Production = 2 nodes.
Crossover moves with volume and utilization: below this workload the cheapest API wins on price, so self-hosting here is a sovereignty decision that also beats the quality tier; fill idle GPU time with overnight batch and the per-token cost drops further.
The wedge
A MAS-regulated bank can't route customer records, transaction data or KYC documents to a hosted LLM API; the data leaves its VPC and its jurisdiction. That rules out OpenAI and Anthropic endpoints for any workload touching production data. Self-hosting open models on the bank's own infrastructure is the option that keeps inference inside the perimeter, which is why the table above prices sovereignty instead of asking whether to pursue it.
2-week POV: success criteria the bank co-signs
Fully-loaded cost/1k queries ≤ $11, below the production-API tier ($16.50 on Claude Sonnet) at the same token shape, measured on the bank's own hardware bill.
p95 latency ≤ 2.5s at 50 concurrent users on a 70B-class open model served through Xinference's OpenAI-compatible endpoint.
Zero data egress: default-deny network policy on the inference node; no outbound calls, evidenced in flow logs.
Functional parity ≥ 95% on a 100-prompt regression set drawn from the bank's top intents, graded against the incumbent API's answers.
Scope & environment
One node (4× H100) for the eval; production sizing is 2 nodes.
One 70B-class open model; one OpenAI-compatible endpoint swap.
Week 1 anonymized data; week 2 controlled production intents.
Out of scope (staged for rollout): multi-model, fine-tuning, HA.
Who does what
Bank infra: provisions the node and network policy.
Bank business owner: supplies top-20 intents and the 100-prompt set.
Solutions Engineer: stands up the cluster, instruments latency and cost, runs the regression, delivers the readout.