TL;DR
- Model tokens are 10-20% of total cost on most enterprise AI programs.
- Integration, data and retrieval infrastructure, evaluation, and governance dominate.
- The "cheap per token" framing misleads leaders who use it as the primary budget driver.
- Self-hosting changes the shape — capex instead of opex — but rarely changes the ranking.
The framing that misleads
Enterprise AI conversations too often start with "tokens are cheap now" and budget accordingly. The token line item on a frontier model in 2026 is a fraction of what it was in 2023. That is true, and it is not the relevant fact. Tokens are not where the money goes.
Where the money actually goes
Based on a representative breakdown of the enterprise AI programs we have delivered or observed in production:
1. Integration engineering (25-35% of TCO)
Getting the model into the workflow — CRM integration, inbox integration, downstream system posting, data retrieval pipelines, UI work, permissions, SSO, logging, the boring plumbing. Unlike model inference, this cost does not decrease with time; it is human engineering. For most clients, this is the single largest line item.
2. Data and retrieval infrastructure (15-25%)
Vector database or equivalent, embedding-pipeline infrastructure, knowledge source ingestion and update pipelines, content processing (chunking, OCR, parsing), domain-specific data preparation. On sophisticated RAG deployments this line item rivals integration.
3. Governance and audit (10-20%)
Audit log pipeline, PII and content-safety scanning, approval-gate infrastructure, evaluation pipeline, compliance review. This is the line item that leaders most often underbudget until the first incident or audit, after which it doubles.
4. Evaluation and ongoing tuning (5-15%)
Test sets, regression suites, grounding evaluation, drift monitoring, prompt and model updates. Rarely zero, rarely more than 15%, but always present if the system operates for more than a quarter.
5. Model inference (10-20%)
Tokens. At enterprise volume this is a real number but usually not the top line item. The variance is wide — high-volume, long-context workloads can push this higher; classification-heavy workloads stay lower.
6. Operations and observability (5-10%)
SRE work, cost monitoring, incident response, ongoing optimization. Also the line item that absorbs the cost-optimization savings over time.
Self-hosting changes the shape
For self-hosted deployments, model inference moves from opex to capex — GPU hardware, operations team, model weights management. The ranking stays similar: integration and governance still dominate. The break-even math shifts (see the self-hosted AI case), but so do the expectations: self-hosted deployments usually carry a heavier governance weight because the regulatory motivation for self-hosting is the same motivation that demands a strict audit posture.
The budgeting pattern that avoids surprises
The pattern we see working:
- Size integration first, model second. Scope the integration work before committing to a model choice. Integration cost is workload-specific; model cost is commodity.
- Budget governance as a first-class cost. Do not treat audit, evaluation, and safety as "we will add later." The "later" cost is 3-5x the "from-the-start" cost.
- Include a model-swap allowance. Plan for at least one major model swap over the life of the system. Vendor landscape changes fast; systems that cannot swap models lose value quickly.
- Track evaluation cost explicitly. A budget line item for evaluation exists; an invisible evaluation cost becomes technical debt that surfaces as a trust problem.
- Scenario-plan volume, not tokens. The question "what if volume doubles" is more predictive than "what if token pricing drops." Plan for the volume change; let the pricing trend be tailwind.
What this means for client engagements
Engagements that are scoped on token cost alone miss the expensive parts. Our proposals allocate explicit line items for integration, data infrastructure, governance, and evaluation — sized to the specific workload. When clients push back on those lines as "overhead," we ask them how they would respond to an audit finding in six months. Usually the conversation finds its way back to the numbers.
For the production patterns behind this, see the TWSS Custom Agents platform case study and the accelerators portfolio.