AI Token Pricing & Consumption

AI & GenAI Procurement Series

← Back to: AI Procurement Overview (Pillar) AI Platform Contract Negotiation OpenAI Enterprise Licensing Copilot vs Gemini TCO AI Token Pricing (You are here) Data Privacy in AI Contracts

15× Cost difference between GPT-4o and GPT-4o-mini at equivalent task volumes

40% Average cost reduction achievable through model routing optimisation

3–5× Output tokens typically cost 3–5× more than input tokens

Understanding Token Pricing Mechanics

A token is roughly equivalent to 0.75 words in English text, or 4 characters. Most AI models process language as tokens rather than words or characters, which makes token count the natural unit of consumption measurement. Token pricing has two components for generation models: input tokens (the text you send to the model, including your prompt, context, and any retrieved documents) and output tokens (the text the model generates in response). Input tokens are typically cheaper; output tokens cost 3–5× more, reflecting the computational cost of generation.

Understanding this distinction is critical for cost modelling. A use case that sends large documents for summarisation (high input, low output) has a very different cost profile from a use case that generates long-form content (moderate input, high output). The ratio of input to output tokens in your workload is one of the most important variables in cost forecasting. This article is part of our broader enterprise AI procurement guide.

Consumption Modelling: How to Forecast AI Costs Accurately

The most common cause of AI budget overruns is inaccurate consumption modelling at the procurement stage. Most organisations either do not model consumption at all (accepting vendor estimates) or model it with optimistic assumptions that do not survive contact with production workloads. A rigorous consumption model has six components.

Free Guide

AI & GenAI Procurement Checklist

The enterprise buyer's checklist for AI contracts — pricing models, SLA clauses, data rights, and exit provisions.

Download Free Guide → AI Software Negotiation Service

Step 1: Identify and Categorise Use Cases

List every AI use case you intend to run in production, categorised by: expected query volume per day, average input token count per query, average output token count per query, model tier required (large/small/reasoning), and growth trajectory over 12 months. This sounds obvious, but most procurement processes conflate multiple very different workloads into a single consumption estimate. A classification use case running 100,000 queries per day at 200 input / 50 output tokens has completely different economics from a document generation use case running 1,000 queries per day at 4,000 input / 2,000 output tokens.

Step 2: Run Representative Workloads

Do not accept vendor-provided consumption estimates or use synthetic benchmarks. Run representative samples of your actual production data through the API and measure actual token consumption. Prompt engineering choices, document preprocessing methods, and output format constraints all affect token counts significantly. The only reliable consumption data is data from your actual workloads.

Step 3: Apply Growth Assumptions Conservatively

AI adoption typically follows an S-curve — slow initial uptake, rapid growth during a success period, then plateau. Most enterprises over-forecast the rapid growth phase when initially procuring AI capacity. Apply conservative growth assumptions (20–30% annual growth for a mature workload, not 100%+) and reserve the ability to increase commitments at pre-agreed rates if growth exceeds forecast. Do not commit to growth-adjusted capacity in Year 1 of an AI deployment.

Step 4: Add Operational Buffers

Production workloads experience variance that laboratory testing does not capture — traffic spikes, retries on failures, context window expansions, and agentic workflows that generate more tokens than expected. Add a 20–25% operational buffer to your base forecast before setting committed spend levels. This buffer is your protection against overage charges, which in AI contracts are often billed at full list rate — eliminating the value of the commitment discount entirely.

Stay Ahead of Vendors

Get Negotiation Intel in Your Inbox

Monthly briefings on vendor pricing changes, audit trends, and contract tactics. Unsubscribe any time.

No spam. No vendor affiliations. Buyer-side only.

Model Selection: The Biggest Cost Lever

Model selection is the single most impactful cost lever in enterprise AI procurement — typically offering greater savings potential than any commercial negotiation tactic. The price difference between the largest and smallest models from the same vendor is 10–20×. For many enterprise use cases, the smaller, cheaper model is entirely sufficient.

Use Case Type	Suitable Model Tier	Cost vs Flagship	Quality Trade-off
Classification / routing	Small (GPT-4o-mini, Claude Haiku)	~5% of flagship cost	Minimal for well-defined categories
Extraction / structured output	Small–Medium	5–20% of flagship cost	Minor quality variance with good prompting
Summarisation	Medium (GPT-4o, Claude Sonnet)	20–40% of flagship cost	Moderate — large models outperform on nuance
Content generation	Large (GPT-4o, Claude Sonnet/Opus)	50–100% of flagship cost	Significant — quality matters for customer-facing output
Complex reasoning / analysis	Reasoning (o3, Claude 3.7)	200–400% of standard cost	Necessary — only reasoning models handle multi-step logic

Implement Model Routing

Model routing — automatically selecting the appropriate model tier for each query based on complexity — can reduce AI infrastructure costs by 30–50% without degrading user experience. A routing layer classifies incoming queries by complexity and routes simple requests to cheaper models, reserving large or reasoning models for queries that genuinely require them. This is a technical architecture decision that should be made alongside (not after) commercial negotiations, because it significantly affects the volume commitment you need at each tier.

Negotiating Volume Discounts on Token Pricing

Token pricing is negotiable at meaningful volume levels. The negotiation dynamics differ by vendor and by commitment structure, but the general framework is consistent across the market.

Commit-for-Discount: The Basic Structure

All major AI API providers offer volume discounts in exchange for committed annual spend. The typical discount schedule runs from 10–15% for $100K annual commitments to 30–40% for $1M+ annual commitments. Beyond $5M annually, custom pricing is available. The discount is applied to per-token rates — so a 30% discount on GPT-4o reduces your effective input token rate from $5/M to $3.50/M. At scale, this represents tens or hundreds of thousands of dollars in annual savings.

Structure the Commitment Correctly

The commitment level should be your base forecast minus a 15–20% buffer (representing the gap between "realistic" and "conservative" forecast). You want to be virtually certain of meeting the commitment to avoid wasting pre-paid credits. Structure the commercial arrangement with: a committed annual minimum (for discount), pre-agreed rates for consumption above the committed level (at a modest premium to the committed rate — not list), and quarterly rollover of up to one quarter's unused committed spend. Without rollover provisions, committed spend that is not consumed is simply lost.

Negotiating Insight: AI vendors often present tiered pricing as "the rates" — as if the tier you fall into automatically determines your rate. In reality, the volume tier thresholds themselves are negotiable for sufficiently large commitments. If your forecast puts you just below a tier threshold, negotiate to access the higher-tier rate at your actual commitment level. Vendors frequently accommodate this for accounts they want to win, particularly when competitive alternatives are credible.

Rate Lock: Protecting Against Price Increases

AI token prices have been declining across the market — but this trend will not continue indefinitely as the market matures. More importantly, individual model versions can see price increases if vendor economics require it. Negotiate rate lock provisions: a contractual guarantee that your per-token rates for named model versions will not increase during the contract term. This is the AI equivalent of the price escalation cap you would negotiate in any traditional software agreement. See our software price escalation cap guide for the broader framework.

Overage Protection

Define the overage rate explicitly in your contract. Overage at full list rate is commercially indefensible for a customer who has made a significant committed spend — it means that consumption above your commitment is billed at a higher rate than your baseline consumption. Negotiate overage at committed-rate + a modest premium (typically 10–15%). This is the same principle as cloud reservation overage management.

Prompt Engineering as a Cost Reduction Strategy

Beyond commercial negotiation, prompt engineering is a direct lever on AI costs. Prompts that are unnecessarily verbose, include excessive context, or fail to constrain output length waste tokens and inflate costs. Systematic prompt optimisation — reviewed by your AI engineering team before production deployment — typically reduces token consumption by 15–25% without affecting output quality.

Specific optimisation techniques include: system prompt compression (remove redundant instructions), context window management (retrieve only relevant document chunks, not entire documents), output format constraints (specify JSON or structured formats to reduce verbose prose), and few-shot example optimisation (use representative examples that require minimal tokens while demonstrating the required output format). These are engineering tasks, but they have direct commercial impact — and the savings are permanent, unlike a one-time discount negotiation.

Caching and Cost Reduction Architecture

Semantic caching — storing and reusing AI responses for identical or near-identical queries — can reduce billable token consumption by 30–60% for use cases with significant query repetition (customer service, FAQ answering, knowledge base retrieval). Several AI vendors now offer native prompt caching capabilities that bill repeated system prompts at reduced rates. OpenAI's prompt caching, for example, reduces repeated context costs significantly. Evaluate vendor caching capabilities as part of your platform selection and consumption modelling — the cost impact can be material enough to affect your vendor preference.

Continue Reading — AI & GenAI Procurement Series

→ Data Privacy in AI Contracts → OpenAI Enterprise Licensing → Copilot vs Gemini TCO → Cloud FinOps Guide → AI Procurement Checklist → Our GenAI Advisory Services

Optimise Your AI Platform Costs

IT Negotiations benchmarks AI token pricing, models consumption, and negotiates volume discount structures for enterprise buyers. Fixed-fee and gain-share models available.

Get a Free Consultation Take the Assessment →

AI Token Pricing & Consumption Negotiation: Enterprise Guide