First POV · worked example

Self-hosted inference for a Singapore retail bank

TCO vs public API on Xinference Enterprise's published $15k/node pricing, plus a 2-week evaluation plan with SMART success criteria.

The deliverable I'd hand a prospect in week one. Hadi Al-Hazim, Singapore.

The number

A MAS-regulated bank can't send production data to any public API, so the question isn't price against the API; it's whether sovereignty costs a premium. Barely: self-hosting on Xinference's $15k/node lands about 38% under the production-API tier (Claude Sonnet class) and within ~2% of the cheapest, pays back in ~16 months against the quality tier, and keeps every token in the bank's VPC. The 38% is their own "40% lower infra cost" claim, reconstructed on real pricing.
Monthly figures at the workload below; the table prices the production rollout (the 2-week POV itself runs on one node). The bank can't send production data to either API; they're the cost yardstick a board benchmarks against.
Public API
GPT-4.1
Public API
Claude Sonnet 4.6
Xinference self-hosted
2 nodes · 8× H100 · on-prem, N+1
Monthly run-rate $15,840 $26,136 $16,200
Cost / 1,000 queries $10.00 $16.50 $10.24
Upfront $0 $0 ~$260k
Self-host break-even vs this API n/a · sovereignty call ~month 16
3-year TCO $570k $941k $584k
Production data egress leaves VPC + jurisdiction leaves VPC + jurisdiction zero; stays on-prem

Assumptions (every input public or labeled; tune with your actuals):

The wedge

A MAS-regulated bank can't route customer records, transaction data or KYC documents to a hosted LLM API; the data leaves its VPC and its jurisdiction. That rules out OpenAI and Anthropic endpoints for any workload touching production data. Self-hosting open models on the bank's own infrastructure is the option that keeps inference inside the perimeter, which is why the table above prices sovereignty instead of asking whether to pursue it.

2-week POV: success criteria the bank co-signs

Scope & environment

  • One node (4× H100) for the eval; production sizing is 2 nodes.
  • One 70B-class open model; one OpenAI-compatible endpoint swap.
  • Week 1 anonymized data; week 2 controlled production intents.
  • Out of scope (staged for rollout): multi-model, fine-tuning, HA.

Who does what

  • Bank infra: provisions the node and network policy.
  • Bank business owner: supplies top-20 intents and the 100-prompt set.
  • Solutions Engineer: stands up the cluster, instruments latency and cost, runs the regression, delivers the readout.

Deployment, verified locally

# OpenAI-compatible inference, one container docker run -p 9997:9997 xprobe/xinference xinference-local --host 0.0.0.0 xinference launch --model-name qwen2.5-instruct --size-in-billions 3 curl http://localhost:9997/v1/chat/completions \ -d '{"model":"qwen2.5-instruct","messages":[{"role":"user","content":"ping"}]}'
Local docker run of xprobe/xinference launching a model and returning an OpenAI-compatible completion via curl.
Stood up on a CPU host in ~30 min; proves stand-up and the API contract. Not a throughput test.
Independent worked example by Hadi Al-Hazim. All pricing public; workload and ops figures labeled and tunable. No bank data used. June 2026.
Sources: 1 OpenAI API pricing · 2 Anthropic pricing · 3 Xinference Enterprise pricing · 4 H100 street price · 5 vLLM throughput