Drop caption .txt files here
…or paste captions (one image per line)
Add caption files or paste captions to see the tag frequency ranking.
| # | Tag | Count | Files |
|---|
"Is the tag I'm actually training for spread across this dataset?" — drop in your LoRA or fine-tuning caption files (tag .txt) and this tool ranks every tag by how often it occurs and how many images carry it (count and percentage). Drop or pick multiple .txt files, or paste captions with one image per line; both sources are merged. With a percentage bar on every row you can immediately spot over-represented tags that sit on nearly every caption (which can make a concept harder to learn) as well as rare tags that only a handful of images carry. By default `long_hair` and `long hair` and letter-case variants are folded into one tag and summed (switch to strict counting anytime), and a tag repeated within a single caption is counted once (an option counts every occurrence). Filter the ranking by tag name on the fly, and export with one click — copy the tag list, or copy a `tag,count,files` CSV straight into a spreadsheet. If you actually need to reformat the tags (normalize underscores vs spaces, escape parentheses) hand them to the sister tool tag-format, and for batch-rewriting many captions use caption-edit. Your training data is your work, so this tool touches no external dictionary or API and does every count locally in your browser.
How to use
- Load your caption .txt files via "Drop caption .txt files here" or "Choose .txt files" (selecting every txt in a dataset folder is fine).
- No files? Paste captions on the right with one image per line (a comma-separated tag list per line). Files and pasted text are combined.
- Read the ranking below for each tag's count and image coverage (percentage bar). Filter by tag name, then copy the tag list or CSV to analyze in a spreadsheet.
FAQ
Are my caption files or images sent to a server?
No. Both the contents of the .txt files you load and any pasted text are counted entirely in your browser as plain string processing, with no external tag dictionary or API. Nothing is uploaded, stored, or sent — it is all processed on your device.
What is the difference between "Count" and "Files"?
"Files" is the number of captions (images) that contain the tag — its coverage across the dataset. "Count" is the total number of times the tag occurred. Since a caption usually has each tag once, the two are nearly identical; they differ only when a tag is repeated within a single caption (by default duplicates are counted once).
Are long_hair and long hair summed as the same tag?
Yes. By default "Match _ and space" is on, so underscore and spaced forms are folded together and summed. This keeps datasets that mix booru-style and prompt-style tags consistent. Turn the option off to count them separately.
Is a tag that appears on every image bad for training?
A deliberate trigger word that you want on every image is fine, but if a feature tag you are trying to teach sits on 100% of captions, the model can treat it as a constant and learn it less distinctly. Use the percentage bar to find tags pinned at 100% — the final call depends on your training setup and goal.
How is a tag that appears twice in one caption counted?
By default it is counted once per caption (duplicates folded). To count the repeats, turn on "Count duplicates per file" and the occurrence count adds each appearance. The Files (document frequency) column always reflects the number of images that contain the tag, regardless of this setting.
Can I analyze the result in Excel or a spreadsheet?
Yes. "Copy CSV" copies the full ranking as `tag,count,files`, ready to paste into Excel or Google Sheets. If you only want the tags, "Copy tags" copies a comma-separated list of all tags in ranking order.