How to Extract URLs from a Text File in Batch Script
Whether you are scraping links from a saved webpage, auditing a browser's history file, or extracting download sources from a script, URL Extraction is a vital productivity task. Manually searching for "https://" through a large text file is slow and error-prone. Automating this process allows you to generate a clean manifest of web resources instantly.
In this guide, we will demonstrate how to extract URLs using regular expressions.
Method 1: The Protocol Filter (FINDSTR)
The findstr command can quickly isolate every line that contains a web protocol (http or https).
This method extracts entire lines that contain a URL, not just the URL itself. It is the fastest approach for a quick scan of small text files.
Implementation Script
@echo off
setlocal
set "Source=Documentation.txt"
set "Output=LinkManifest.txt"
if not exist "%Source%" (
echo [ERROR] Source file "%Source%" not found.
pause
exit /b 1
)
echo Extracting URLs from %Source%...
:: /I makes the search case-insensitive
:: /R enables regex mode
:: https*:// matches "http://" or "https://" (the s is optional)
findstr /I /R "https*://" "%Source%" > "%Output%"
echo [DONE] Lines containing links saved to %Output%.
endlocal
pause
Method 2: The Precision Link Scraper (PowerShell)
Standard findstr will give you the entire line where a link was found. If you want a list of just the URLs (without the surrounding text), the PowerShell regex bridge is required.
This method outputs only the matched URLs, one per line, sorted and deduplicated. It is ideal for feeding results into bulk downloaders like curl or wget.
Implementation Script
@echo off
setlocal
set "Source=scraped_data.txt"
set "Output=urls_only.txt"
if not exist "%Source%" (
echo [ERROR] Source file "%Source%" not found.
pause
exit /b 1
)
echo Scraping clean URLs...
:: This regex identifies the protocol followed by non-space characters
:: Stops at quotes, angle brackets, and whitespace to avoid capturing delimiters
powershell -NoProfile -Command ^
"$content = Get-Content -Raw '%Source%';" ^
"$regex = 'https?://[^\s\""''<>]+';" ^
"[regex]::Matches($content, $regex) | ForEach-Object { $_.Value } | Sort-Object -Unique | Set-Content '%Output%'"
echo [DONE] Clean URL list created in %Output%.
endlocal
pause
Why Extract URLs?
- Automation: Extracting download links from a readme file to feed into a bulk downloader like
curlorwget. - Compliance: Auditing a company's internal wiki or chat logs to ensure no links to unapproved external file-sharing sites are present.
- Security: Scanning logs for "Phishing-like" URLs or suspicious top-level domains.
Best Practices
Many text files wrap URLs in quotes ("https://...") or angle brackets (<https://...>). Always ensure your regex excludes these delimiter characters so they do not become part of the extracted URL.
- Handle Delimiters: In many text files, URLs are enclosed in quotes or brackets. Your regex (see Method 2) should be smart enough to stop at these delimiters rather than including them in the URL.
- Deduplication: Many documents link to the same "Main" page repeatedly. Always sort and deduplicate your list to keep the output concise.
- Incomplete Links: Be aware that some files might use protocol-relative links (starting with
//www...) or relative paths (/index.html). If you need these, your search pattern will need to be much broader.
Conclusion
Extracting URLs turns a cluttered text document into a clickable inventory of web resources. While native Batch commands provide a lightning-fast way to find lines containing links, the PowerShell bridge provides the precision needed for automated toolchains and high-volume data scraping. By mastering these patterns, you ensure that you can harvest the internet's resources from your terminal with absolute ease.