Duplicate files (general)

What Counts as a Duplicate File? Byte-Identical vs Similar Files Explained

Most Mac users assume duplicate files are easy to spot: same name, same size, done. In practice, that definition causes real problems. Tools that rely on name or size matching routinely flag files that are not actually duplicates, while genuinely identical files sitting in different folders with different names go unnoticed. This article explains what a duplicate file actually is at a technical level, walks through the three detection methods you will encounter, and shows you which one is the only truly reliable standard.

The Core Question: What Is a Duplicate File?

A duplicate file, in the strictest sense, is a file whose contents are completely and exactly identical to another file on the same volume. Not just similar. Not just the same name. Byte for byte, every character and every bit of data must match. That distinction matters enormously when you are deciding what is safe to delete.

There are three ways software can identify potential duplicates, and they differ dramatically in accuracy:

  • Name matching: same filename or filename pattern
  • Size matching: same byte count reported by the filesystem
  • Content hashing: a cryptographic fingerprint of the file's actual data

Understanding why the first two fail is the key to not accidentally deleting files you need.

Name Matching: The Least Reliable Method

Name matching is the simplest and least accurate approach. A scanner looks for files with the same filename and flags them as duplicates. This sounds reasonable until you consider how macOS actually works.

Your system has hundreds of files named Info.plist. Every app bundle contains one. They are not duplicates. They describe completely different applications. Similarly, many apps ship with a file called default.png, icon.icns, or README.txt. Matching on name alone produces enormous numbers of false positives.

Common false positives from name matching:

  • /Applications/Safari.app/Contents/Info.plist and /Applications/Xcode.app/Contents/Info.plist: same name, completely different files
  • Any two apps with a shared asset name like background.png
  • Template files in productivity apps that use a standard naming convention

Name matching has its uses as a first pass to narrow a search space, but it should never be the final criterion for deletion.

Size Matching: Better, But Still Unreliable

Size matching adds a second filter: same filename and same byte count. This is more selective than name matching alone, but it still produces false positives and misses real duplicates.

False positives from size matching: Two files can have the same size by coincidence. An 8 KB PNG icon and an 8 KB PDF thumbnail are clearly not the same file, but both criteria would match if their names also happened to align.

Real duplicates missed by size matching: You may have a photo called IMG_4821.JPG in ~/Downloads and the same photo imported into Photos and renamed to 2024-07-15-beach.jpg. Different name, same data. Size-only scanning would not catch this.

Size matching also has a metadata problem. macOS reports file sizes in multiple ways: the "logical size" (actual data bytes) and the "disk usage" (which depends on filesystem allocation blocks). Two files with identical content but different filesystem metadata can sometimes show slightly different disk sizes depending on how they were created or copied. Relying on size alone adds another layer of uncertainty.

Content-Based Duplicate Detection: The Accurate Standard

Content-based duplicate detection, also called hash-based or byte-identical detection, works differently. Instead of comparing names or sizes, it computes a cryptographic hash of each file's actual data and compares those fingerprints.

A hash function (common ones include SHA-256 and MD5) reads every byte of a file and produces a short fixed-length string. If two files produce the same hash, their contents are identical. If even a single bit differs, the hashes are different. This is the only method that reliably answers "are these two files exactly the same?"

You can verify this yourself in Terminal. Open Terminal.app and run:

shasum -a 256 ~/Downloads/photo.jpg
shasum -a 256 ~/Pictures/photo-backup.jpg

If both commands return the same 64-character string, the files are byte-identical duplicates. If the strings differ by even one character, they are not the same file regardless of what their names or sizes say.

Why Hash Matching Is Definitive

A hash collision (two different files producing the same SHA-256 hash) is theoretically possible but so astronomically unlikely that it is not a practical concern for personal storage management. SHA-256 has not had a known practical collision. For everyday duplicate finding, a matching SHA-256 hash is a reliable confirmation of identical content.

This matters most in cases like:

  • Photos copied from one folder to another with renamed filenames
  • Documents emailed to yourself and also saved from a download
  • App installers downloaded twice under different names
  • Music files in both a "purchased" folder and an iTunes library

In every case above, name and size matching may fail. Hash matching will not.

Duplicate vs Similar: An Important Distinction

Once you understand byte-identical detection, the next question people often ask is: what about files that are almost the same? A photo taken in burst mode, a document with minor edits, two exports of the same design at different resolutions. These are similar files, not duplicate files, and the distinction is significant.

Byte-identical duplicates: Every bit matches. Completely safe to delete one copy without losing any data. The remaining copy is a perfect replacement.

Similar files: Visually close or semantically related, but the data differs. Deleting one may mean losing a slightly different crop, a revision, or a different export setting. These require human judgment.

Some tools blur this line by using perceptual hashing (for images) or fuzzy matching to surface similar-looking files. That can be useful for cleaning up burst photo sets or near-duplicate exports, but it is a fundamentally different operation. You are not removing redundant data. You are curating. Make sure any tool you use is clear about which mode it is operating in before you delete anything.

Where True Duplicates Actually Accumulate on macOS

Knowing where byte-identical duplicates tend to pile up helps you search more efficiently:

  • ~/Downloads: The most common source. Installers, PDFs, and ZIP files downloaded multiple times.
  • ~/Desktop: Files dragged here for temporary access and forgotten, often already existing elsewhere.
  • ~/Documents and iCloud Drive: Documents that were emailed as attachments, saved locally, and also exist in cloud sync.
  • External drives and backups: Files copied manually to an external drive that were then also backed up by Time Machine or another tool.
  • Photo libraries: Photos imported multiple times, or photos that exist both inside ~/Pictures/Photos Library.photoslibrary and as loose files in ~/Downloads.

Note that inside the Photos library itself, the actual image files live at a path like ~/Pictures/Photos Library.photoslibrary/originals/. Photos does its own deduplication, so you usually will not find duplicates inside the library bundle. The duplicates are typically loose files outside it.

What to Check Before Deleting a Duplicate

Even with a confirmed hash match, a quick checklist before deletion reduces risk:

  1. Which copy is in the better location? Keep the one in the organized folder, not the one in ~/Downloads.
  2. Are there symlinks or aliases pointing to either copy? Check with ls -la in Terminal. Deleting a file that other files reference can break those references.
  3. Is one copy inside an app bundle? Files inside .app packages are part of that app and should not be removed individually.
  4. Does one copy have a more recent modification date? For most media, dates do not affect content. For documents, a different modification date might mean a later version with unsaved differences.

A good duplicate finder will surface this context alongside the hash match so you can make an informed decision rather than approving a bulk delete blindly.

How Crumb Handles Duplicate Detection

Crumb uses byte-identical hashing to identify true duplicates, not name or size matching. When it scans for duplicates, it computes content fingerprints and groups files that are provably identical. Before anything is removed, Crumb shows you a reviewable plan: which files are in each group, where they live, and how much space you would recover. It also runs an "is this safe to delete?" check so you can confirm each removal makes sense. Everything runs on-device with no account required.

Understanding what counts as a duplicate file, and insisting on content-based detection rather than name or size shortcuts, is the difference between reclaiming real space and accidentally deleting files that only looked like copies.

Reclaim your disk in one click

Crumb audits your whole Mac, tells you what's safe to delete, and frees the space in seconds — private, local, and Apple-notarized.

Download Crumb for macOS

Frequently asked questions

What is a duplicate file on a Mac?
A duplicate file is one whose contents are byte-for-byte identical to another file on the same storage volume. The name and location can be completely different. The only reliable way to confirm a duplicate is to compare cryptographic hashes of both files' actual data, not just their names or reported sizes.
Is a file with the same name as another file a duplicate?
Not necessarily. macOS has hundreds of files named things like Info.plist or icon.icns because every app uses common naming conventions for required components. Two files can share a name while containing completely different data. Name matching alone produces far too many false positives to be a reliable duplicate detection method.
What is the difference between a duplicate file and a similar file?
A duplicate file is byte-identical to another file: every bit of data matches, and either copy is a perfect replacement for the other. A similar file is one that resembles another visually or semantically (like a burst-mode photo or a lightly edited document) but whose actual data differs. Duplicates are safe to delete one copy of; similar files require human judgment about which version to keep.
How does hash-based duplicate detection work?
A hash function reads every byte of a file and produces a short fixed-length fingerprint string. If two files produce the same hash (using an algorithm like SHA-256), their contents are identical. You can test this yourself by running shasum -a 256 followed by the file path in Terminal and comparing the output for two files you suspect are duplicates.
Where do duplicate files most commonly build up on macOS?
The Downloads folder is the most common source, since installers, PDFs, and archives are often downloaded more than once. Duplicates also accumulate when files are copied manually to external drives and then also captured by a backup tool, or when documents exist both as email attachments saved locally and as cloud-synced copies in iCloud Drive.