Skip to main content

How to Extract Unique Lines from a File in Batch Script

When dealing with large datasets, such as server logs, lists of user IDs, or inventories of hardware, you often encounter redundant entries. Extracting Unique Lines (deduplication) is the process of stripping away these duplicates so that each piece of information appears only once. This is essential for accurate calculations, such as counting how many actual unique users logged in, rather than how many login events occurred.

In this guide, we will demonstrate how to extract unique lines using the sort command and comparison logic.

Method 1: The "Unique-Only" Filter (Native Batch)

The most effective way to deduplicate in Batch is to first sort the file so duplicates are grouped together, and then use a for /f loop to compare each line to the one before it.

Implementation Script

@echo off
setlocal disabledelayedexpansion

set "InputFile=Data.txt"
set "OutputFile=Unique_Data.txt"
set "TempSorted=%TEMP%\sorted_uniq_%RANDOM%.txt"

:: Verify source file exists
if not exist "%InputFile%" (
echo [ERROR] Source file "%InputFile%" not found.
pause
exit /b 1
)

echo Extracting unique lines from "%InputFile%"...

:: 1. Sort the file so duplicates are adjacent
sort "%InputFile%" /o "%TempSorted%"

:: 2. Filter out the repeats by comparing each line to the previous
set "prev="
(
for /f "usebackq delims=" %%A in ("%TempSorted%") do (
set "current=%%A"
setlocal enabledelayedexpansion
if "!current!" neq "!prev!" (
echo(!current!
endlocal
set "prev=%%A"
) else (
endlocal
)
)
) > "%OutputFile%"

:: 3. Clean up temp file
del "%TempSorted%" 2>nul

echo [SUCCESS] Unique lines saved to "%OutputFile%".
pause
exit /b 0
warning

The for /f loop skips blank lines by design and also skips lines beginning with ; (the default eol character). If your input file contains blank lines or lines starting with ;, those lines will be silently dropped from the output regardless of whether they are unique. The PowerShell method (Method 2) does not have these limitations.

tip

The script uses the delayed expansion toggle pattern: each line is set with delayed expansion disabled (set "current=%%A" and set "prev=%%A") to preserve literal ! characters in the file content. It is then compared and output with delayed expansion enabled (if "!current!" neq "!prev!" and echo(!current!) to safely handle &, |, >, <, and other special characters.

If you are on Windows 10 or later, the PowerShell bridge is significantly faster and handles special characters, blank lines, and semicolons safely.

Implementation Script

@echo off
setlocal

set "Source=Source.txt"
set "Dest=Unique_Output.txt"

:: Verify source file exists
if not exist "%Source%" (
echo [ERROR] Source file "%Source%" not found.
pause
exit /b 1
)

echo Extracting unique lines from "%Source%"...

:: Sort-Object -Unique sorts and deduplicates in a single pass
powershell -NoProfile -Command ^
"Get-Content -Path '%Source%' | " ^
"Sort-Object -Unique | " ^
"Set-Content -Path '%Dest%' -Encoding UTF8"

if %errorlevel% equ 0 (
echo [SUCCESS] Unique lines saved to "%Dest%".
) else (
echo [ERROR] PowerShell deduplication failed.
pause
exit /b 1
)
pause
exit /b 0
info

The Sort-Object -Unique cmdlet performs both sorting and deduplication in a single pass. The approach of piping through Sort-Object and then Get-Unique as separate steps works but is less efficient. Get-Unique also requires its input to be pre-sorted, so combining them into one cmdlet eliminates the risk of accidentally piping unsorted data to the uniqueness filter.

Comparisons: Batch vs. PowerShell

FeatureBatch MethodPowerShell Method
SpeedSlow for large filesFast for large files
Blank linesDropped silentlyPreserved
Semicolon linesDropped silentlyPreserved
Special charactersSafe with toggle patternInherently safe
Case sensitivityCase-sensitive by defaultCase-insensitive by default
Memory usageLow (disk-based temp file)Higher (in-memory array)

Why Extract Unique Lines?

  1. Auditing: Convert a list of 10,000 firewall hits into a list of the 50 unique IP addresses involved.
  2. Clean Lists: Ensure an email distribution list or a deployment manifest does not send the same file or message to the same target twice.
  3. Efficiency: Reducing the size of a text file before performing a secondary, expensive operation (like a web query) saves significant time and bandwidth.

Best Practices

  1. Verify Source File: Always check that the input file exists before processing. A missing file will cause sort to fail silently or PowerShell to throw an exception.
  2. Case Sensitivity: The Batch if neq comparison is case-sensitive by default, so Apple and apple are treated as different entries. Use if /i for case-insensitive deduplication. The PowerShell Sort-Object -Unique is case-insensitive by default. Use Sort-Object -Unique -CaseSensitive to treat Apple and apple as distinct entries.
  3. Handling Blanks: If you want to remove blank lines while extracting unique lines, combine the Batch method with a findstr /v /r "^$" filter on the output. The PowerShell method preserves blank lines but deduplicates them to a single occurrence.
  4. Column-Based Deduplication: If your file is a CSV and you want to extract unique entries based on only one column (e.g., unique usernames from a "Username,Date,Action" log), set tokens and delims in your for /f loop accordingly, or use Sort-Object -Property in PowerShell to target a specific field.
  5. Temp File Hygiene: Use the %TEMP% directory for intermediate files and include %RANDOM% in the filename to avoid collisions when multiple instances run simultaneously. Always delete the temp file at the end of the script.
  6. Encoding: Include -Encoding UTF8 in the PowerShell Set-Content command to prevent non-ASCII characters from being corrupted by the default ANSI encoding in PowerShell 5.1.

Conclusion

Extracting unique lines is a fundamental data-cleaning task that ensures your automation is working with the most efficient dataset possible. By using the "sort and compare" logic for native Batch or the Sort-Object -Unique bridge for PowerShell, you can quickly turn bloated, redundant logs into precise, actionable inventories. This attention to data quality is what separates a basic script from a professional, production-ready tool.