The short version
Cloud LLMs win on ease, latest-capability access, and zero-infrastructure start. Self-hosted LLMs win on data residency, vendor independence, and cost at high volume. The decision is rarely made on features — it is made on compliance, competitive sensitivity, and operational philosophy. In 2026, the quality gap is narrow enough that most enterprise workloads can ship on either.
Side-by-side
| Dimension | Cloud LLMs | Self-hosted LLMs | |---|---|---| | Data residency | Vendor-dependent (DPA, BAA, region) | Full client control | | Vendor dependency | High (pricing, roadmap, deprecation) | Low (open weights, portable) | | Startup effort | API key + prompt | GPU setup + Ollama/vLLM + model serving | | Quality on frontier tasks | Highest available | Close on most workloads; narrower on novel reasoning | | Latency floor | Network + provider | Local GPU limits | | Cost at low volume | Cheapest | Fixed GPU cost | | Cost at high volume | Scales linearly with tokens | Amortized GPU cost dominates | | Operational burden | Near zero | Real (monitoring, model ops, scaling) | | Auditability | Provider logs + client logs | Full client trace |
When cloud LLMs are the right choice
- Data can legitimately flow to the vendor under the applicable DPA/BAA and regional controls.
- The workload benefits from access to the latest capabilities (latest Claude, GPT, or Gemini).
- Volume is modest enough that per-token pricing beats amortized GPU cost.
- The team does not want the operational burden of running models.
- Speed to first production use is the dominant factor.
When self-hosted LLMs are the right choice
- Data cannot flow to a third party under any current terms (HIPAA, classified, competitive sensitivity).
- The workload runs at volume where amortized GPU cost beats API tokens.
- Vendor independence is a design principle — pricing, roadmap, or geopolitical risk concerns.
- Latency floor matters and the network round-trip is part of the problem.
- Compliance posture demands full trace control on the model layer, not vendor-mediated.
The 2026 self-hosted stack
Mature enough that a reference architecture is stable:
- Models. Llama 3.3 70B, Qwen 2.5 series, Mistral Medium, Gemma 27B. Domain-tuned variants for specific categories.
- Serving. Ollama for simplicity; vLLM or TGI for high-throughput production; TensorRT-LLM for lowest-latency.
- Hardware. H100, H200, or MI300 depending on preference and availability. Multi-GPU for larger models.
- Orchestration. FastAPI or similar agent orchestration in front of the serving layer. MCP for tool protocol.
- Observability. Standard APM plus model-specific metrics (token counts, latency distribution, evaluation scores).
The hybrid posture
Many enterprises end up hybrid:
- Self-hosted for sensitive workloads (regulated content, proprietary research, customer data under strict contracts).
- Cloud for general-purpose productivity workloads, code assistance, and non-sensitive analytics.
- Router in front that classifies each request and sends it to the appropriate model based on content sensitivity.
Hybrid captures the cost and ease of cloud where appropriate while keeping sensitive data on self-hosted infrastructure. The router adds complexity but solves a real problem.
The cost math worth doing
For a specific workload:
- Estimate monthly token volume (both input and output).
- Multiply by the cloud vendor's rate for the chosen model. That is monthly cloud cost.
- Size the GPU required to serve the same workload at peak. Multiply by monthly amortized GPU cost plus ops.
- The break-even point is where the two numbers meet. Below that volume, cloud wins on pure cost. Above, self-hosted wins.
For most enterprise workloads in 2026, the break-even is at a volume well above what any single application produces but well below what an enterprise's total AI spend produces. That is why portfolio-level decisions often favor self-hosting even when individual workloads would not.
How Thoughtwave approaches this
We are model-neutral. For regulated clients where self-hosting is the required answer, see our TWSS Commercial Credit AI case study — a fully self-hosted 3-model ensemble in production. For cloud-forward clients, we run OpenAI, Anthropic, and Google deployments at scale.
For deeper context, see the AI & Generative AI service and the broader accelerators portfolio.