0 · —

Prompt

CLIP splits prompts into chunks of 75 tokens (+2 for start/end = 77). Counting matches A1111 / ComfyUI. Runs fully in your browser.

Tokens

A tool that counts how many tokens your Stable Diffusion / A1111 / ComfyUI prompt becomes in the CLIP text encoder, and shows in real time where it hits the 75-token boundaries. CLIP splits a prompt into chunks of 75 tokens (plus a start and end special token, so 77 per chunk); anything past 75 spills into the next chunk. On long prompts, seeing which word starts a new chunk — and how many tokens you have left before 75 — makes it easy to move the words you want emphasized toward the front of a chunk, or trim what you don't need. Tokenization faithfully reproduces CLIP's own byte-level BPE, so the count matches A1111's "x/75" readout and ComfyUI (for example `a photo of a cat` is 5 tokens and `masterpiece, best quality, 1girl` is 7). Paste a prompt and each token is shown as a chip, with a divider drawn every 75 tokens. Tokens that end a word (where a space follows) get a faint marker, so you can see when a single word splits across several tokens — `lowres` becomes `low` + `res`. The CLIP vocabulary used for the lookup (the BPE merge table) is a static file we host ourselves; the prompt you type is never uploaded or stored. Your prompt is the blueprint of your work, so everything is computed locally in your browser.

How to use

Paste your Stable Diffusion prompt into the input box (tags or natural language are both fine).
Tokens appear as chips on the right, with a "chunk boundary" line drawn every 75 tokens. The toolbar shows the total token count and number of chunks.
To stay under 75 or to control where the chunks break, trim or reorder words and watch the boundary line move.

FAQ

Is my prompt sent anywhere?

No. All tokenization happens inside your browser. The only thing loaded is the CLIP vocabulary file (the BPE merge table) that we host ourselves — the prompt you type is never uploaded, stored, or sent anywhere.

Does the count match A1111 and ComfyUI?

Yes. It faithfully reproduces CLIP's own byte-level BPE, so it matches the "x/75" number A1111 shows and ComfyUI's token count. We verified cases like `a photo of a cat` = 5 and `masterpiece, best quality, 1girl` = 7.

What is the "75-token chunk"?

CLIP splits a prompt into chunks of 75 tokens and adds two special start/end tokens to each (77 per chunk). Past 75, content moves into the next chunk, where — depending on the UI — it can be weighted differently or truncated. This tool draws a line at each of those boundaries.

Why does one word become several tokens?

CLIP doesn't map a word to a single token; it uses BPE to split text into common sub-strings. For example `lowres` splits into `low` + `res`. The chips and the end-of-word markers let you see how each word was broken up.

Can it count Japanese prompts?

Yes, but CLIP wasn't trained mainly on Japanese, so Japanese text is broken into many byte-level pieces and uses far more tokens than English tags. That's why a Japanese-heavy prompt shows a high token count.

Are weighting syntaxes like `(word:1.2)` and `[ ]` included in the count?

This tool tokenizes the text exactly as CLIP would, so parentheses and numbers are counted as characters too. In real generation, the UI often parses the weighting syntax and strips the brackets, so a prompt that uses a lot of emphasis may have a lower effective token count than shown here. Use this as a guide to the raw token count.