Skip to main content

How to Find and Delete Duplicate Files in Batch Script

Duplicate files are a common source of wasted disk space and digital clutter. Finding and removing them can be a tedious manual task, making it a perfect candidate for automation. While Windows does not provide a single command to find duplicates, a batch script can be built to perform this task by comparing the unique digital "fingerprint" of every file, its cryptographic hash.

This guide will teach you how to create a powerful script that recursively scans a directory tree, calculates a hash for each file, and identifies duplicates based on identical hashes. We will provide a safe "report-only" version of the script and then, with critical safety warnings, show how it can be modified to automatically delete the redundant copies.

The Challenge of Finding True Duplicates

Simply comparing filenames is unreliable, as two different files can have the same name. Comparing file sizes is a good first step, but it's not foolproof. The only truly reliable way to determine if two files have the exact same content is to calculate and compare their cryptographic hash (e.g., MD5 or SHA1). A hash is a unique string generated from the file's content; if two files have the same hash, their content is identical.

The Core Method: Comparing File Hashes

Our script will use the following logic:

  1. Use a recursive FOR /R loop to scan every file in a directory tree.
  2. For each file, use the built-in CertUtil command to calculate its hash.
  3. Use an "associative array" trick with environment variables to keep a record of every unique hash we've seen.
  4. If we calculate a hash and find it's already in our record, we have found a duplicate. The first file we saw with that hash is considered the "original," and the current file is the "duplicate."

Script: Finding and Reporting Duplicates (Safe Version)

This is the recommended version of the script. It does not delete anything. It only creates a report of which files are duplicates and which file is being kept as the original.

@ECHO OFF
SETLOCAL ENABLEDELAYEDEXPANSION

SET "START_FOLDER=C:\Users\Admin\Documents"

ECHO --- Duplicate File Finder (Report Only) ---
ECHO Scanning folder: %START_FOLDER%
ECHO This may take a very long time...
ECHO.

REM Clear any previous hash records
FOR /F "delims==" %%V IN ('SET hash_') DO SET "%%V="

FOR /R "%START_FOLDER%" %%F IN (*) DO (
IF %%~zF GTR 0 (
REM Get the file hash (default is SHA1)
FOR /F "skip=1 delims=" %%H IN ('CertUtil -hashfile "%%~fF"') DO (
SET "HASH=%%H"
SET "HASH=!HASH: =!"

REM Check if we have seen this hash before
IF DEFINED hash_[!HASH!] (
ECHO Found Duplicate: "%%~fF"
ECHO ...is a duplicate of: !hash_[!HASH!]!
ECHO.
) ELSE (
REM If not, store this file as the "original" for this hash
SET "hash_[!HASH!]=%%~fF"
)
GOTO :NextHash
)
:NextHash
)
)

ECHO Scan complete.
ENDLOCAL

How the script works:

  • SETLOCAL ENABLEDELAYEDEXPANSION: Essential for reading and writing to the hash_ variables inside the loop.
  • FOR /F ... IN ('SET hash_') DO ...: This is a cleanup routine that clears any hash_ variables from a previous run.
  • FOR /R "%START_FOLDER%" %%F IN (*): The main recursive loop that finds every file. We also check IF %%~zF GTR 0 to ignore empty files.
  • CertUtil -hashfile "%%~fF": Calculates the SHA1 hash for the current file (%%~fF).
  • SET "HASH=!HASH: =!": The hash from CertUtil contains spaces, so this line removes them to create a clean key.
  • IF DEFINED hash_[!HASH!]: This is the core logic. It checks if a variable named hash_ followed by the actual hash string (e.g., hash_a94a8fe5...) already exists.
  • SET "hash_[!HASH!]=%%~fF": If the variable does not exist, this is the first time we've seen this hash. We create the variable and set its value to the full path of the current file, marking it as the "original."

Critical Safety Warning: The Dangers of Automated Deletion

Automatically deleting files is inherently risky. Before you even consider modifying the script to delete files, understand the following:

  • No Recycle Bin: Files deleted with the DEL command are permanently gone. There is no undo.
  • The "Original" is Arbitrary: The script keeps the first file it happens to find and deletes all subsequent copies. This might not be the copy you wanted to keep (e.g., it might keep a file from a temporary folder and delete the one in your main project folder).
  • Hash Collisions: While astronomically rare, it is theoretically possible for two different files to produce the same hash.
  • Legitimate Duplicates: You may have valid reasons for having identical files in different locations.
note

Recommendation: Always run the "report-only" script first. Review the output carefully. Only use the deleting version if you have a full backup and accept the risks.

The Deletion Logic (Advanced and Potentially Dangerous)

To modify the script to delete duplicates, you change the IF DEFINED block. Instead of echoing a report, you will execute a DEL command on the current file (%%~fF).

REM ... inside the main FOR /R loop ...

IF DEFINED hash_[!HASH!] (
ECHO Deleting duplicate: "%%~fF"
ECHO (Original located at: !hash_[!HASH!]!)
ECHO.

REM --- THIS IS THE DELETION COMMAND ---
DEL "%%~fF"

) ELSE (
SET "hash_[!HASH!]=%%~fF"
)

This change transforms the script from a reporting tool into a destructive one. Use with extreme caution.

Common Pitfalls and How to Solve Them

Performance on Large Directories

This script is very slow. Calculating a cryptographic hash requires reading the entire contents of every single file. On a large drive with many gigabytes of data, this script can take hours to run.

Solution: There is no pure-batch solution to this. The best you can do is narrow the START_FOLDER to be as specific as possible. For better performance, a preliminary check to group files by size first would be necessary, but this adds significant complexity.

The "First File Wins" Assumption

The script's logic assumes the first file it encounters is the one to keep. The order in which FOR /R finds files is generally alphabetical but should not be relied upon for critical logic.

Solution: There is no simple scripting solution to this. The safest approach is to use the report-only script and make manual deletion decisions. A more advanced script could be written to apply a rule (e.g., "keep the file with the shortest path"), but this is highly specific to the user's needs.

Conclusion

Finding duplicate files with a batch script is a powerful demonstration of what is possible with CertUtil and advanced loop logic. However, it is also a task that carries significant risk if automated deletion is involved.

  • The report-only script is a safe and highly effective tool for identifying where duplicate files exist, allowing you to manually review and delete them.
  • The deleting version is a powerful but dangerous tool that should only be used after running a report, making a full backup of your data, and understanding that it will permanently delete files based on its "first-found" logic.

For any critical system, always prefer the safe, report-only approach.