How to Extract Unique Lines from a File in Batch Script
When dealing with large datasets, such as server logs, lists of user IDs, or inventories of hardware, you often encounter redundant entries. Extracting Unique Lines (deduplication) is the process of stripping away these duplicates so that each piece of information appears only once. This is essential for accurate calculations, such as counting how many actual unique users logged in, rather than how many login events occurred.
In this guide, we will demonstrate how to extract unique lines using the sort command and comparison logic.
Method 1: The "Unique-Only" Filter (Native Batch)
The most effective way to deduplicate in Batch is to first sort the file so duplicates are grouped together, and then use a for /f loop to compare each line to the one before it.
Implementation Script
@echo off
setlocal disabledelayedexpansion
set "InputFile=Data.txt"
set "OutputFile=Unique_Data.txt"
set "TempSorted=%TEMP%\sorted_uniq_%RANDOM%.txt"
:: Verify source file exists
if not exist "%InputFile%" (
echo [ERROR] Source file "%InputFile%" not found.
pause
exit /b 1
)
echo Extracting unique lines from "%InputFile%"...
:: 1. Sort the file so duplicates are adjacent
sort "%InputFile%" /o "%TempSorted%"
:: 2. Filter out the repeats by comparing each line to the previous
set "prev="
(
for /f "usebackq delims=" %%A in ("%TempSorted%") do (
set "current=%%A"
setlocal enabledelayedexpansion
if "!current!" neq "!prev!" (
echo(!current!
endlocal
set "prev=%%A"
) else (
endlocal
)
)
) > "%OutputFile%"
:: 3. Clean up temp file
del "%TempSorted%" 2>nul
echo [SUCCESS] Unique lines saved to "%OutputFile%".
pause
exit /b 0
The for /f loop skips blank lines by design and also skips lines beginning with ; (the default eol character). If your input file contains blank lines or lines starting with ;, those lines will be silently dropped from the output regardless of whether they are unique. The PowerShell method (Method 2) does not have these limitations.
The script uses the delayed expansion toggle pattern: each line is set with delayed expansion disabled (set "current=%%A" and set "prev=%%A") to preserve literal ! characters in the file content. It is then compared and output with delayed expansion enabled (if "!current!" neq "!prev!" and echo(!current!) to safely handle &, |, >, <, and other special characters.
Method 2: The PowerShell Bridge (Recommended for Large Files)
If you are on Windows 10 or later, the PowerShell bridge is significantly faster and handles special characters, blank lines, and semicolons safely.
Implementation Script
@echo off
setlocal
set "Source=Source.txt"
set "Dest=Unique_Output.txt"
:: Verify source file exists
if not exist "%Source%" (
echo [ERROR] Source file "%Source%" not found.
pause
exit /b 1
)
echo Extracting unique lines from "%Source%"...
:: Sort-Object -Unique sorts and deduplicates in a single pass
powershell -NoProfile -Command ^
"Get-Content -Path '%Source%' | " ^
"Sort-Object -Unique | " ^
"Set-Content -Path '%Dest%' -Encoding UTF8"
if %errorlevel% equ 0 (
echo [SUCCESS] Unique lines saved to "%Dest%".
) else (
echo [ERROR] PowerShell deduplication failed.
pause
exit /b 1
)
pause
exit /b 0
The Sort-Object -Unique cmdlet performs both sorting and deduplication in a single pass. The approach of piping through Sort-Object and then Get-Unique as separate steps works but is less efficient. Get-Unique also requires its input to be pre-sorted, so combining them into one cmdlet eliminates the risk of accidentally piping unsorted data to the uniqueness filter.
Comparisons: Batch vs. PowerShell
| Feature | Batch Method | PowerShell Method |
|---|---|---|
| Speed | Slow for large files | Fast for large files |
| Blank lines | Dropped silently | Preserved |
| Semicolon lines | Dropped silently | Preserved |
| Special characters | Safe with toggle pattern | Inherently safe |
| Case sensitivity | Case-sensitive by default | Case-insensitive by default |
| Memory usage | Low (disk-based temp file) | Higher (in-memory array) |
Why Extract Unique Lines?
- Auditing: Convert a list of 10,000 firewall hits into a list of the 50 unique IP addresses involved.
- Clean Lists: Ensure an email distribution list or a deployment manifest does not send the same file or message to the same target twice.
- Efficiency: Reducing the size of a text file before performing a secondary, expensive operation (like a web query) saves significant time and bandwidth.
Best Practices
- Verify Source File: Always check that the input file exists before processing. A missing file will cause
sortto fail silently or PowerShell to throw an exception. - Case Sensitivity: The Batch
if neqcomparison is case-sensitive by default, soAppleandappleare treated as different entries. Useif /ifor case-insensitive deduplication. The PowerShellSort-Object -Uniqueis case-insensitive by default. UseSort-Object -Unique -CaseSensitiveto treatAppleandappleas distinct entries. - Handling Blanks: If you want to remove blank lines while extracting unique lines, combine the Batch method with a
findstr /v /r "^$"filter on the output. The PowerShell method preserves blank lines but deduplicates them to a single occurrence. - Column-Based Deduplication: If your file is a CSV and you want to extract unique entries based on only one column (e.g., unique usernames from a "Username,Date,Action" log), set
tokensanddelimsin yourfor /floop accordingly, or useSort-Object -Propertyin PowerShell to target a specific field. - Temp File Hygiene: Use the
%TEMP%directory for intermediate files and include%RANDOM%in the filename to avoid collisions when multiple instances run simultaneously. Always delete the temp file at the end of the script. - Encoding: Include
-Encoding UTF8in the PowerShellSet-Contentcommand to prevent non-ASCII characters from being corrupted by the default ANSI encoding in PowerShell 5.1.
Conclusion
Extracting unique lines is a fundamental data-cleaning task that ensures your automation is working with the most efficient dataset possible. By using the "sort and compare" logic for native Batch or the Sort-Object -Unique bridge for PowerShell, you can quickly turn bloated, redundant logs into precise, actionable inventories. This attention to data quality is what separates a basic script from a professional, production-ready tool.