The short version
- An LLM is a neural network trained to predict the next token given prior context.
- Modern LLMs have billions to hundreds of billions of parameters and emerged from scaling the transformer architecture.
- Enterprises consume LLMs via API or self-host; the choice is driven by data residency, cost, and governance.
The longer explanation
The training objective
LLMs train on the simplest imaginable objective: given a sequence of tokens, predict the next one. Scale that objective across trillions of tokens of training data and tens of thousands of GPU-hours, and the resulting model has internalized grammar, facts, reasoning patterns, and stylistic conventions well enough to respond coherently to prompts it has never seen.
The specific family most enterprise LLMs come from is the decoder-only transformer. The architectural details have evolved (sparse attention, mixture-of-experts, long-context extensions) but the core objective has not.
How enterprises consume LLMs
Two paths dominate:
API-hosted: OpenAI, Anthropic, Google, and others expose LLMs as cloud APIs. The upside is zero infrastructure work and instant access to the latest models. The downside is that data leaves the client environment, and the vendor's pricing, rate limits, and roadmap are outside the client's control. For most enterprise generative AI today, this is the starting point.
Self-hosted: Open-weight models (Llama from Meta, Mistral, Qwen, Gemma from Google DeepMind) run on client-owned infrastructure, typically via Ollama, vLLM, or TGI. The trade-off is more operational work in exchange for full data control. Regulated industries — banking, healthcare, government — often require this path for production workloads.
The capabilities that matter in 2026
- Long context. Top models handle 200K-2M token contexts, enabling document-in, analysis-out workflows without chunking.
- Tool use and function calling. LLMs can call external APIs as part of their response, which is the foundation for agentic workflows.
- Structured output. Models can return JSON or other structured formats reliably, enabling integration with downstream systems.
- Multimodal. Top models handle text plus image, and increasingly audio and video.
Choosing a model
The engineering decision is not "which is best overall" — it is "which fits this workload". The evaluation axes we use in client engagements:
- Quality on the actual task. Run a scoped eval on the client's real data.
- Latency. Interactive workflows need sub-second response; batch workflows have more slack.
- Cost at projected volume. Frontier model pricing can swing total cost 10x; a right-sized smaller model often wins.
- Data residency and governance. If data cannot leave the environment, the answer is self-hosted.
- Vendor stability. Roadmap, pricing history, deprecation posture.
How Thoughtwave approaches this
We are model-neutral across OpenAI, Anthropic, Google, Meta, Mistral, and Qwen. Our engagements typically run a 2-3 model evaluation on the client's workload in the first two weeks and make the selection with the client based on the axes above. For production workloads with data-residency constraints, we deploy on client infrastructure via Ollama or vLLM — the pattern behind our self-hosted TWSS Commercial Credit AI platform.
For the broader context, see our AI & Generative AI service and the accelerators portfolio.