Token Counters Compared: GPT-5, Claude Opus 4.7, and Gemini 3.1 — How They Tokenize Differently
The first time I shipped a feature backed by an LLM, I budgeted for $40/month. The first invoice was $312. The bug wasn’t in my code — it was in my mental model. I had benchmarked the prompt with OpenAI’s tokenizer, then quietly switched the production model to Claude two days before launch because the answers were better. Same prompt, same users, same volume — but Claude’s tokenizer counted my system prompt at 18% more tokens, and the output rate was 5× the input rate. The math compounded fast.
Tokens are the unit your AI bill is denominated in, and every model family counts them differently. Same English sentence, three different numbers. If you’re calling these APIs in production, knowing how each tokenizer behaves is the difference between a $40 line item and a $300 surprise.
This post compares the three tokenizer families that matter in May 2026 — OpenAI’s o200k_base, Anthropic’s tokenizer, and Google’s SentencePiece — with real examples, edge cases, and the practical implications for cost and context windows. At the end I’ll point you at a free LLM Token Counter that does the side-by-side calculation for you.
What is a token, really?
Before we compare tokenizers, a quick refresher on what they actually do. LLMs don’t read characters or words — they read tokens, which are sub-word fragments produced by an algorithm called Byte-Pair Encoding (BPE) or one of its cousins (SentencePiece, WordPiece).
The algorithm starts with raw bytes and iteratively merges the most-frequent pairs in the training corpus until it has a vocabulary of typically 50k–250k tokens. The result is that:
- Common English words are usually one token:
the,and,building. - Rare words split into pieces:
tokenization→token+ization. - Code punctuation often becomes its own token:
{,},;,=>. - Whitespace is usually attached to the following word:
" the"is one token. - Non-Latin scripts split per-character or even per-byte.
This matters because tokens are not interchangeable units across vendors. Each model family trained its own tokenizer on its own data mix. The same text produces different counts and even different segmentation strategies.
If the BPE explanation made you want a refresher on encoding fundamentals, the Complete Guide to Encoding & Decoding on this site walks through related concepts.
The three tokenizers in 2026
OpenAI — cl100k_base and o200k_base
OpenAI’s current production tokenizer for GPT-4o and GPT-5 is o200k_base, a successor to cl100k_base (used by GPT-3.5 and GPT-4). The vocabulary roughly doubled from ~100k to ~200k tokens, which means better compression on multilingual and code inputs — fewer tokens to express the same content.
You can reproduce OpenAI’s exact counts locally using the open-source tiktoken library:
import tiktoken
enc = tiktoken.get_encoding("o200k_base")
tokens = enc.encode("Hello, world! This is a test.")
print(len(tokens)) # 8
OpenAI publishes the encoding reference; for English prose the rough rule is 1 token ≈ 4 characters ≈ 0.75 words.
Anthropic — claude-tokenizer
Anthropic uses its own custom tokenizer that, in our testing, produces ~5–10% more tokens than o200k_base on English prose and similar counts on code. As of 2026 Anthropic exposes token counting through the count_tokens endpoint on the Messages API rather than shipping an offline library — a deliberate choice that lets them update tokenization without breaking client code.
For Claude Opus 4.7, Sonnet 4.6, and Haiku 4.5, you’d call:
curl https://api.anthropic.com/v1/messages/count_tokens \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-d '{"model": "claude-opus-4-7", "messages": [{"role":"user","content":"Hello"}]}'
Practical implication: Claude’s denser tokenization means the same prompt costs slightly more on Claude than on GPT at equivalent per-token rates. With Claude Opus 4.7 priced at $15/M input vs. GPT-5’s $5/M, the gap widens further.
Google — SentencePiece (Gemini)
Gemini 1.5 and Gemini 3.1 Pro use a SentencePiece tokenizer. SentencePiece treats text as a stream of Unicode codepoints from the start (no language-specific pre-tokenization), which gives it more uniform behavior across scripts. The vocabulary is in the same ~250k range as o200k_base.
You can count Gemini tokens via the official SDK:
import google.generativeai as genai
model = genai.GenerativeModel('gemini-3.1-pro')
response = model.count_tokens("Hello, world!")
print(response.total_tokens)
In our experience Gemini’s counts land between OpenAI and Anthropic for English prose, slightly under OpenAI for code, and noticeably more efficient for Asian languages.
Same prompt, three different numbers
Here’s the same 51-character English sentence run through each tokenizer:
“The quick brown fox jumps over the lazy dog twice.”
| Tokenizer | Token count | Chars/token |
|---|---|---|
OpenAI o200k_base (GPT-4o, GPT-5) | 11 | 4.6 |
| Anthropic Claude (Opus/Sonnet/Haiku) | 12 | 4.3 |
| Google SentencePiece (Gemini 3.1 Pro) | 11 | 4.6 |
A one-token difference looks trivial — until you multiply by a 50k-token system prompt, then by 100k user requests per day. That’s a 5M-token-per-day delta, which on Claude Opus 4.7 is $75 every single day in input tokens you didn’t budget for.
Now look at what happens with code:
def fibonacci(n: int) -> int:
if n < 2:
return n
return fibonacci(n - 1) + fibonacci(n - 2)
| Tokenizer | Token count |
|---|---|
OpenAI o200k_base | 32 |
| Anthropic Claude | 35 |
| Google SentencePiece | 33 |
Code is denser in tokens because punctuation, indentation, and snake_case names don’t merge into common BPE pairs. Expect 25–35% more tokens per character in code compared to English prose.
For non-Latin scripts the ratios change again. The Japanese sentence “東京の天気は晴れです” (10 characters) tokenizes as:
| Tokenizer | Token count |
|---|---|
OpenAI o200k_base | 8 |
| Anthropic Claude | 10 |
| Google SentencePiece | 6 |
Gemini’s SentencePiece — designed without language-specific pre-tokenization — wins comfortably on Asian text. If your product is multilingual, this can swing your bill by 30%+ before you change a single character of your prompts.
Why the differences matter
Tokenizer differences cascade into three real concerns:
1. Cost
You pay per token, not per character. A 5–15% tokenization difference between vendors shows up as 5–15% on your invoice — assuming everything else is equal. Combined with vendor pricing differences, switching from GPT-4o ($2.50/M input) to Claude Opus 4.7 ($15/M input) is a 6× input-cost jump on the same prompt, before tokenization adjustments.
The LLM Token Counter tool on this site has a model dropdown that lets you flip between vendors and see the cost recompute live. It’s the fastest way to ballpark a model swap before you touch the API.
2. Context window utilization
Every model has a maximum context (input + output combined). In May 2026:
- GPT-5: 400k tokens
- GPT-4o: 128k tokens
- Claude Opus 4.7 / Sonnet 4.6: up to 1M tokens
- Gemini 3.1 Pro: up to 2M tokens
If you’re stuffing a 100k-token document into the prompt and the tokenizer counts 8% denser than expected, you’ll silently truncate or hit a 400 error. Always count against the actual production tokenizer before assuming a context fits.
3. Output quality
Tokenization affects how the model “sees” the input. Two real cases I’ve hit:
- Number formatting:
1234567may tokenize as1234567(one token) or123+4567(two tokens) depending on the tokenizer. Models trained on the former handle digit-grouping math better. - JSON keys: Snake_case keys like
"user_id"may tokenize as"user,_id"— a four-token cost vs. ~two foruserId. Beyond cost, if the model has seenuserIdin training more frequently, you’ll get more reliable JSON output by matching that style.
Edge cases that bite
Emoji and unicode
Emoji tokenize unpredictably. A single 🚀 might be one token in some tokenizers and 2–4 bytes split across multiple tokens in others. o200k_base handles modern emoji as single tokens; older tokenizers don’t. If your product accepts user-submitted text with emoji, count tokens against actual user samples — not lorem ipsum.
URLs and base64
Long opaque strings like eyJhbGciOiJIUzI1NiIs... (a JWT) tokenize at roughly one token per 1.2 characters because the BPE algorithm has no learned pairs in that vocabulary. A 600-character JWT is ~500 tokens. If you need to debug or inspect JWTs without paying for them in prompts, the JWT Debugger lets you decode locally first.
Same story with base64 — if you have a workflow that’s stuffing image data into prompts, Image to Base64 followed by sending only the relevant part is far cheaper than dumping raw blobs to the model.
Whitespace asymmetry
"hello" and " hello" are usually different tokens. Models are trained with whitespace attached to the following word, so the leading-space variant is more “natural” and often tokenizes as one token while the no-space version splits oddly. This rarely matters for prompt cost but can affect generation quality on the boundary.
Cost comparison: same prompt, all five major models
Here’s a 2,000-token system prompt + 500-token user query + 800-token expected output, run once:
| Model | Input cost | Output cost | Total per call |
|---|---|---|---|
| GPT-5 | $0.0125 | $0.0120 | $0.0245 |
| GPT-4o | $0.0063 | $0.0080 | $0.0143 |
| GPT-4o mini | $0.0004 | $0.0005 | $0.0009 |
| Claude Opus 4.7 | $0.0375 | $0.0600 | $0.0975 |
| Claude Sonnet 4.6 | $0.0075 | $0.0120 | $0.0195 |
| Claude Haiku 4.5 | $0.0020 | $0.0032 | $0.0052 |
| Gemini 3.1 Pro | $0.0088 | $0.0084 | $0.0172 |
| Gemini 1.5 Flash | $0.0002 | $0.0002 | $0.0004 |
Multiply by 100k calls/month and the spread is from $40 (Gemini 1.5 Flash) to $9,750 (Claude Opus 4.7) — a 240× range for what’s nominally the same task. Picking the right model is by far the highest-leverage cost lever in a production AI feature.
How to count tokens in production
A few patterns that have served me well:
1. Estimate before you commit. For prototyping and rough budgets, a calibrated heuristic — chars-per-token adjusted by content type — is within 5% of the true tokenizer for English prose and good enough to make decisions. The LLM Token Counter does exactly this across all major models.
2. Use the official tokenizer for billing-critical paths. Once a feature is shipping, replace the heuristic with the vendor’s official counter. For OpenAI use tiktoken (offline, fast). For Claude use the count_tokens API endpoint (network call, but accurate). For Gemini use the SDK’s count_tokens().
3. Cache token counts where possible. If your system prompt is static, count it once at deploy time and cache the number. You don’t need to retokenize every request.
4. Set hard limits early. Reject inputs above a threshold (e.g., 8k tokens) at your API boundary, not at the model boundary. Token counting is cheap; LLM calls are not.
5. Log per-request token counts. When billing surprises happen, you want to see which requests are expensive. Log input tokens, output tokens, and model per call and you’ll find the long-tail cost outliers in five minutes.
Quick reference
| Goal | Best choice |
|---|---|
| Cheapest with reasonable quality | GPT-4o mini or Gemini 1.5 Flash |
| Best quality regardless of cost | Claude Opus 4.7 or GPT-5 |
| Largest context window | Gemini 3.1 Pro (2M tokens) |
| Most accurate tokenization for non-Latin scripts | Gemini 3.1 Pro (SentencePiece) |
| Open-weights / self-hostable | Llama 3.3 70B (~$0.20/M input on hosted APIs) |
Try it yourself
The fastest way to internalize how these tokenizers behave is to paste your real prompt into a counter and toggle between models. The LLM Token Counter does this with no signup, no data leaving your browser, and a context-window meter so you can see when a long prompt is about to overflow.
Pair it with the Word Counter for character/word stats and the JSON Formatter if you’re building structured-output prompts — those three tools cover most of the prompt-engineering surface area.
Tokens are not going away as the unit of LLM economics. The faster you build intuition for how each tokenizer treats your specific content — your code style, your domain vocabulary, your users’ language — the smaller your bill and the larger your usable context. A few hours of measurement now saves a lot of invoice surprises later.
Tags
Related Articles
Top 10 Developer Tools Every Programmer Should Know in 2025
Discover the essential developer tools that will boost your productivity in 2025. From regex testers to JSON formatters, these utilities are must-haves for modern developers.
Regular Expressions Mastery: The Complete Guide from Basics to Advanced Patterns
Master regular expressions with this comprehensive guide. Learn regex syntax, pattern matching, validation techniques, and real-world examples for web development.
Complete Guide to Encoding & Decoding for Web Developers
Master encoding and decoding with this comprehensive guide. Learn Base64, URL encoding, HTML entities, JWT tokens, and when to use each format in modern web development.