Token Counters Compared: GPT-5, Claude Opus 4.7, and Gemini 3.1 — How They Tokenize Differently

The first time I shipped a feature backed by an LLM, I budgeted for $40/month. The first invoice was $312. The bug wasn’t in my code — it was in my mental model. I had benchmarked the prompt with OpenAI’s tokenizer, then quietly switched the production model to Claude two days before launch because the answers were better. Same prompt, same users, same volume — but Claude’s tokenizer counted my system prompt at 18% more tokens, and the output rate was 5× the input rate. The math compounded fast.

Tokens are the unit your AI bill is denominated in, and every model family counts them differently. Same English sentence, three different numbers. If you’re calling these APIs in production, knowing how each tokenizer behaves is the difference between a $40 line item and a $300 surprise.

This post compares the three tokenizer families that matter in May 2026 — OpenAI’s o200k_base, Anthropic’s tokenizer, and Google’s SentencePiece — with real examples, edge cases, and the practical implications for cost and context windows. At the end I’ll point you at a free LLM Token Counter that does the side-by-side calculation for you.

What is a token, really?

Before we compare tokenizers, a quick refresher on what they actually do. LLMs don’t read characters or words — they read tokens, which are sub-word fragments produced by an algorithm called Byte-Pair Encoding (BPE) or one of its cousins (SentencePiece, WordPiece).

The algorithm starts with raw bytes and iteratively merges the most-frequent pairs in the training corpus until it has a vocabulary of typically 50k–250k tokens. The result is that:

Common English words are usually one token: the, and, building.
Rare words split into pieces: tokenization → token + ization.
Code punctuation often becomes its own token: {, }, ;, =>.
Whitespace is usually attached to the following word: " the" is one token.
Non-Latin scripts split per-character or even per-byte.

This matters because tokens are not interchangeable units across vendors. Each model family trained its own tokenizer on its own data mix. The same text produces different counts and even different segmentation strategies.

If the BPE explanation made you want a refresher on encoding fundamentals, the Complete Guide to Encoding & Decoding on this site walks through related concepts.

The three tokenizers in 2026

OpenAI — `cl100k_base` and `o200k_base`

OpenAI’s current production tokenizer for GPT-4o and GPT-5 is o200k_base, a successor to cl100k_base (used by GPT-3.5 and GPT-4). The vocabulary roughly doubled from ~100k to ~200k tokens, which means better compression on multilingual and code inputs — fewer tokens to express the same content.

You can reproduce OpenAI’s exact counts locally using the open-source tiktoken library:

import tiktoken
enc = tiktoken.get_encoding("o200k_base")
tokens = enc.encode("Hello, world! This is a test.")
print(len(tokens))  # 8

OpenAI publishes the encoding reference; for English prose the rough rule is 1 token ≈ 4 characters ≈ 0.75 words.

Anthropic — `claude-tokenizer`

Anthropic uses its own custom tokenizer that, in our testing, produces ~5–10% more tokens than o200k_base on English prose and similar counts on code. As of 2026 Anthropic exposes token counting through the count_tokens endpoint on the Messages API rather than shipping an offline library — a deliberate choice that lets them update tokenization without breaking client code.

For Claude Opus 4.7, Sonnet 4.6, and Haiku 4.5, you’d call:

curl https://api.anthropic.com/v1/messages/count_tokens \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -d '{"model": "claude-opus-4-7", "messages": [{"role":"user","content":"Hello"}]}'

Practical implication: Claude’s denser tokenization means the same prompt costs slightly more on Claude than on GPT at equivalent per-token rates. With Claude Opus 4.7 priced at $15/M input vs. GPT-5’s $5/M, the gap widens further.

Google — SentencePiece (Gemini)

Gemini 1.5 and Gemini 3.1 Pro use a SentencePiece tokenizer. SentencePiece treats text as a stream of Unicode codepoints from the start (no language-specific pre-tokenization), which gives it more uniform behavior across scripts. The vocabulary is in the same ~250k range as o200k_base.

You can count Gemini tokens via the official SDK:

import google.generativeai as genai
model = genai.GenerativeModel('gemini-3.1-pro')
response = model.count_tokens("Hello, world!")
print(response.total_tokens)

In our experience Gemini’s counts land between OpenAI and Anthropic for English prose, slightly under OpenAI for code, and noticeably more efficient for Asian languages.

Same prompt, three different numbers

Here’s the same 51-character English sentence run through each tokenizer:

“The quick brown fox jumps over the lazy dog twice.”

Tokenizer	Token count	Chars/token
OpenAI `o200k_base` (GPT-4o, GPT-5)	11	4.6
Anthropic Claude (Opus/Sonnet/Haiku)	12	4.3
Google SentencePiece (Gemini 3.1 Pro)	11	4.6

A one-token difference looks trivial — until you multiply by a 50k-token system prompt, then by 100k user requests per day. That’s a 5M-token-per-day delta, which on Claude Opus 4.7 is $75 every single day in input tokens you didn’t budget for.

Now look at what happens with code:

def fibonacci(n: int) -> int:
    if n < 2:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)

Tokenizer	Token count
OpenAI `o200k_base`	32
Anthropic Claude	35
Google SentencePiece	33

Code is denser in tokens because punctuation, indentation, and snake_case names don’t merge into common BPE pairs. Expect 25–35% more tokens per character in code compared to English prose.

For non-Latin scripts the ratios change again. The Japanese sentence “東京の天気は晴れです” (10 characters) tokenizes as:

Tokenizer	Token count
OpenAI `o200k_base`	8
Anthropic Claude	10
Google SentencePiece	6

Gemini’s SentencePiece — designed without language-specific pre-tokenization — wins comfortably on Asian text. If your product is multilingual, this can swing your bill by 30%+ before you change a single character of your prompts.

Why the differences matter

Tokenizer differences cascade into three real concerns:

1. Cost

You pay per token, not per character. A 5–15% tokenization difference between vendors shows up as 5–15% on your invoice — assuming everything else is equal. Combined with vendor pricing differences, switching from GPT-4o ($2.50/M input) to Claude Opus 4.7 ($15/M input) is a 6× input-cost jump on the same prompt, before tokenization adjustments.

The LLM Token Counter tool on this site has a model dropdown that lets you flip between vendors and see the cost recompute live. It’s the fastest way to ballpark a model swap before you touch the API.

2. Context window utilization

Every model has a maximum context (input + output combined). In May 2026:

GPT-5: 400k tokens
GPT-4o: 128k tokens
Claude Opus 4.7 / Sonnet 4.6: up to 1M tokens
Gemini 3.1 Pro: up to 2M tokens

If you’re stuffing a 100k-token document into the prompt and the tokenizer counts 8% denser than expected, you’ll silently truncate or hit a 400 error. Always count against the actual production tokenizer before assuming a context fits.

3. Output quality

Tokenization affects how the model “sees” the input. Two real cases I’ve hit:

Number formatting: 1234567 may tokenize as 1234567 (one token) or 123 + 4567 (two tokens) depending on the tokenizer. Models trained on the former handle digit-grouping math better.
JSON keys: Snake_case keys like "user_id" may tokenize as "user, _id" — a four-token cost vs. ~two for userId. Beyond cost, if the model has seen userId in training more frequently, you’ll get more reliable JSON output by matching that style.

Edge cases that bite

Emoji and unicode

Emoji tokenize unpredictably. A single 🚀 might be one token in some tokenizers and 2–4 bytes split across multiple tokens in others. o200k_base handles modern emoji as single tokens; older tokenizers don’t. If your product accepts user-submitted text with emoji, count tokens against actual user samples — not lorem ipsum.

URLs and base64

Long opaque strings like eyJhbGciOiJIUzI1NiIs... (a JWT) tokenize at roughly one token per 1.2 characters because the BPE algorithm has no learned pairs in that vocabulary. A 600-character JWT is ~500 tokens. If you need to debug or inspect JWTs without paying for them in prompts, the JWT Debugger lets you decode locally first.

Same story with base64 — if you have a workflow that’s stuffing image data into prompts, Image to Base64 followed by sending only the relevant part is far cheaper than dumping raw blobs to the model.

Whitespace asymmetry

"hello" and " hello" are usually different tokens. Models are trained with whitespace attached to the following word, so the leading-space variant is more “natural” and often tokenizes as one token while the no-space version splits oddly. This rarely matters for prompt cost but can affect generation quality on the boundary.

Cost comparison: same prompt, all five major models

Here’s a 2,000-token system prompt + 500-token user query + 800-token expected output, run once:

Model	Input cost	Output cost	Total per call
GPT-5	$0.0125	$0.0120	$0.0245
GPT-4o	$0.0063	$0.0080	$0.0143
GPT-4o mini	$0.0004	$0.0005	$0.0009
Claude Opus 4.7	$0.0375	$0.0600	$0.0975
Claude Sonnet 4.6	$0.0075	$0.0120	$0.0195
Claude Haiku 4.5	$0.0020	$0.0032	$0.0052
Gemini 3.1 Pro	$0.0088	$0.0084	$0.0172
Gemini 1.5 Flash	$0.0002	$0.0002	$0.0004

Multiply by 100k calls/month and the spread is from $40 (Gemini 1.5 Flash) to $9,750 (Claude Opus 4.7) — a 240× range for what’s nominally the same task. Picking the right model is by far the highest-leverage cost lever in a production AI feature.

How to count tokens in production

A few patterns that have served me well:

1. Estimate before you commit. For prototyping and rough budgets, a calibrated heuristic — chars-per-token adjusted by content type — is within 5% of the true tokenizer for English prose and good enough to make decisions. The LLM Token Counter does exactly this across all major models.

2. Use the official tokenizer for billing-critical paths. Once a feature is shipping, replace the heuristic with the vendor’s official counter. For OpenAI use tiktoken (offline, fast). For Claude use the count_tokens API endpoint (network call, but accurate). For Gemini use the SDK’s count_tokens().

3. Cache token counts where possible. If your system prompt is static, count it once at deploy time and cache the number. You don’t need to retokenize every request.

4. Set hard limits early. Reject inputs above a threshold (e.g., 8k tokens) at your API boundary, not at the model boundary. Token counting is cheap; LLM calls are not.

5. Log per-request token counts. When billing surprises happen, you want to see which requests are expensive. Log input tokens, output tokens, and model per call and you’ll find the long-tail cost outliers in five minutes.

Quick reference

Goal	Best choice
Cheapest with reasonable quality	GPT-4o mini or Gemini 1.5 Flash
Best quality regardless of cost	Claude Opus 4.7 or GPT-5
Largest context window	Gemini 3.1 Pro (2M tokens)
Most accurate tokenization for non-Latin scripts	Gemini 3.1 Pro (SentencePiece)
Open-weights / self-hostable	Llama 3.3 70B (~$0.20/M input on hosted APIs)

Try it yourself

The fastest way to internalize how these tokenizers behave is to paste your real prompt into a counter and toggle between models. The LLM Token Counter does this with no signup, no data leaving your browser, and a context-window meter so you can see when a long prompt is about to overflow.

Pair it with the Word Counter for character/word stats and the JSON Formatter if you’re building structured-output prompts — those three tools cover most of the prompt-engineering surface area.

Tokens are not going away as the unit of LLM economics. The faster you build intuition for how each tokenizer treats your specific content — your code style, your domain vocabulary, your users’ language — the smaller your bill and the larger your usable context. A few hours of measurement now saves a lot of invoice surprises later.

Token Counters Compared: GPT-5, Claude Opus 4.7, and Gemini 3.1 — How They Tokenize Differently

What is a token, really?

The three tokenizers in 2026

OpenAI — `cl100k_base` and `o200k_base`

Anthropic — `claude-tokenizer`

Google — SentencePiece (Gemini)

Same prompt, three different numbers

Why the differences matter

1. Cost

2. Context window utilization

3. Output quality

Edge cases that bite

Emoji and unicode

URLs and base64

Whitespace asymmetry

Cost comparison: same prompt, all five major models

How to count tokens in production

Quick reference

Try it yourself

Tags

Share this article

Related Articles

Top 10 Developer Tools Every Programmer Should Know in 2025

Regular Expressions Mastery: The Complete Guide from Basics to Advanced Patterns

Complete Guide to Encoding & Decoding for Web Developers

What is a token, really?

The three tokenizers in 2026

OpenAI — cl100k_base and o200k_base

Anthropic — claude-tokenizer

Google — SentencePiece (Gemini)

Same prompt, three different numbers

Why the differences matter

1. Cost

2. Context window utilization

3. Output quality

Edge cases that bite

Emoji and unicode

URLs and base64

Whitespace asymmetry

Cost comparison: same prompt, all five major models

How to count tokens in production

Quick reference

Try it yourself

Tags

Share this article

Related Articles

Top 10 Developer Tools Every Programmer Should Know in 2025

Regular Expressions Mastery: The Complete Guide from Basics to Advanced Patterns

Complete Guide to Encoding & Decoding for Web Developers

OpenAI — `cl100k_base` and `o200k_base`

Anthropic — `claude-tokenizer`