No near-duplicates found at this strictness.
Drop your dataset images here
or click to choose (select many at once)
Add 2+ images to find near-duplicates
processed in your browser · never uploaded
A tool for finding near-duplicate images in the dataset you've gathered to train a LoRA or fine-tune Stable Diffusion. Adjacent frames from a burst, a file that's just been cropped or brightened a little, or a copy that became a different file after a resize or re-save — none of these are caught by exact filename or file-size matching. This tool shrinks each image to 32×32 grayscale and computes a DCT-based perceptual hash (pHash, 64 bits). Because pHash captures the overall composition and light/dark layout rather than fine detail, images that look alike get similar hashes even when resolution or light edits differ. The smaller the Hamming distance (the number of differing bits) between two hashes, the more alike the images; the tool links images under a distance threshold into near-duplicate groups. Each group shows its first image as the suggested 'keep' and the rest as 'duplicates', and reports how many images you could cut in total (sum of group sizes minus the number of groups). You can copy all the duplicate filenames in one click to clean up the dataset in your file manager or on the command line. Match strictness has three levels — Strict / Standard / Loose: Strict catches only near-identical shots, Loose pulls in 'similar-looking' images too. When a dataset has too many images of the same subject in the same composition, the trained model leans toward those shots and loses variety. Thin out the duplicates that cause that bias before you train. Everything runs inside your browser; the images you drop are never uploaded, stored, or sent anywhere — your dataset never leaves your device.
How to use
- Drop your training images all at once (or click to select many). Detection starts at 2+ images.
- Near-duplicates are grouped together. The first image in each group is the suggested 'keep' and the rest are 'duplicates', shown with their Hamming distance.
- Switch 'match strictness' between Strict / Standard / Loose to tune how aggressively it groups.
- Use 'Copy duplicate filenames' to grab the list of removal candidates and clean up the dataset in your file manager or shell.
FAQ
Are my images uploaded anywhere?
No. Shrinking, hashing, and comparing all happen inside your browser. The images you drop are never uploaded, stored, or sent anywhere — your dataset never leaves your device.
What is a perceptual hash (pHash)?
It shrinks an image to a small grayscale thumbnail and uses a DCT (discrete cosine transform) to turn the broad composition and light/dark layout into a 64-bit fingerprint. Because it captures the overall look rather than fine detail, images that look alike get similar hashes even at different resolutions or after light edits. The fewer differing bits (the Hamming distance) between two hashes, the more alike they are.
How do Strict / Standard / Loose differ?
They change the Hamming-distance threshold for calling two images near-duplicates. Strict groups only very-close, near-identical shots; Standard catches the same composition (burst frames, slight re-crops); Loose pulls in merely 'similar-looking' images. Tighten toward Strict if it groups too much, loosen if it misses some.
How are 'keep' and 'duplicate' decided?
The first image loaded within a group is marked 'keep' and the rest 'duplicate' — purely as a hint. The tool never deletes any files. Look at the actual images to decide which one to keep.
Will it catch flipped or heavily edited images?
A pHash changes when an image is mirrored, so a flipped copy is usually treated as a different image (it may be a duplicate for your purpose, but this tool won't flag it). Large color/brightness changes or heavy cropping also produce a different fingerprint once the look changes enough. This tool is best at finding near-duplicates that still look almost the same.
How many images can I check at once?
There's no hard limit, but the comparison is all-pairs (it scales with the square of the count), so detection slows down past a few hundred images. The UI avoids freezing while loading, but for very large datasets we recommend checking folder by folder.