How to Convert a PDF to Text in Batch Script
Batch scripts natively process text and strings but lack the built-in capability to parse binary formats like PDFs. To extract text from a PDF for automated analysis or indexing, you must leverage a command-line utility. The open-source pdftotext from the Poppler utilities is the industry standard for this task.
In this guide, we will demonstrate how to automate the conversion of a single PDF and a directory of PDFs to plain text using pdftotext.exe.
Prerequisites
pdftotext.exe is part of the open source Poppler utilities. It is not included with Windows by default. You can download prebuilt Windows binaries from:
- Poppler for Windows (recommended)
- XpdfReader
Extract pdftotext.exe and its required DLLs from the /bin folder into a folder in your system PATH, or place it in the same folder as your Batch script.
Verify it is installed correctly by running:
pdftotext -v
The Strategy: Integrating the Utility
- Download and install the Poppler utilities.
- Place
pdftotext.exein your PATH or alongside your script. - Use Batch's
FORloop to execute the utility. - Optionally process the resulting
.txtfiles.
Single File Conversion
The syntax for pdftotext is straightforward: pdftotext [options] <PDF-file> [<text-file>].
@echo off
setlocal enabledelayedexpansion
:: Define the target PDF
set "inputFile=document.pdf"
:: Verify pdftotext is available
where pdftotext >nul 2>nul
if !errorlevel! NEQ 0 (
echo [ERROR] pdftotext.exe not found. Ensure it is in your PATH or alongside this script.
pause
exit /b 1
)
:: Ensure the file exists
if not exist "!inputFile!" (
echo [ERROR] !inputFile! not found!
pause
exit /b 1
)
:: Execute conversion
echo Converting !inputFile! to text...
pdftotext -layout "!inputFile!"
if !errorlevel! EQU 0 (
echo Conversion complete. Check "!inputFile:~0,-4!.txt"
) else (
echo [ERROR] Conversion failed with exit code !errorlevel!
)
pause
- Use
-layout(default for this guide) to preserve columns, tables, indentation and page structure. This is best for invoices, reports and formatted documents. - Use
-rawto output text in reading order with no layout preservation. This is best for subsequent text searching and parsing withfindstr.
Batch Processing an Entire Directory
If you need to extract text from hundreds of invoices or reports, wrap the conversion inside a FOR loop.
@echo off
setlocal enabledelayedexpansion
:: Define the source and destination directories
set "sourceDir=C:\Invoices_PDF"
set "destDir=C:\Invoices_Text"
:: Verify pdftotext is available
where pdftotext >nul 2>nul
if !errorlevel! NEQ 0 (
echo [ERROR] pdftotext.exe not found.
pause
exit /b 1
)
:: Validate source directory
if not exist "!sourceDir!\" (
echo [ERROR] Source directory does not exist: !sourceDir!
pause
exit /b 1
)
:: Create the destination folder if missing
if not exist "!destDir!" mkdir "!destDir!"
echo Starting batch conversion...
echo.
set "processed=0"
set "failed=0"
:: Loop through all PDFs
for %%f in ("!sourceDir!\*.pdf") do (
echo Processing: %%~nxf
pdftotext -layout -q -enc UTF-8 "%%f" "!destDir!\%%~nf.txt"
if !errorlevel! EQU 0 (
set /a "processed+=1"
) else (
echo [FAILED] %%~nxf
set /a "failed+=1"
)
)
echo.
echo ==========================================
echo Conversion complete.
echo Processed: !processed! files
echo Failed: !failed! files
echo Results saved in !destDir!
echo ==========================================
if !processed! EQU 0 (
echo [WARNING] No PDF files were found in !sourceDir!
)
pause
To process all PDFs in all subdirectories recursively, replace the for loop with:
for /r "!sourceDir!" %%f in (*.pdf) do (
pdftotext -layout -q -enc UTF-8 "%%f" "!destDir!\%%~nf.txt"
)
Why Convert a PDF to Text?
- Automated Searching: You can't use
FINDSTR(the native grep-like search in Batch) on a PDF. By converting a folder of 500 reports to.txt, you can instantlyFINDSTR "Account 54321" *.txt. - Data Extraction: Extracting line items, totals, or metadata from recurring system-generated PDFs (like billing statements) to feed an SQL database.
- Archiving: Reducing the storage footprint when only the textual content of a document matters (e.g., long e-discovery or compliance log reports).
Important Considerations
pdftotext extracts the embedded text layer from the PDF. If the PDF consists of scanned images (pictures of text), the utility will export an empty or near-empty file. In this case, you need an OCR (Optical Character Recognition) tool like Tesseract instead.
For encrypted or password protected PDFs, use the password flags:
-opw passwordfor owner password-upw passwordfor user password
pdftotext -upw "MyPassword123" "protected.pdf" "output.txt"
Always add the -enc UTF-8 flag to properly handle foreign characters, copyright symbols, smart quotes, and other special characters. Without this flag, the default encoding may produce garbled output for non-ASCII text.
Post-Conversion Searching
The most common reason to convert PDFs to text is to enable searching. After conversion you can use Batch's native findstr to search across all documents:
:: Search all converted text files for "Account 54321"
findstr /i /s "Account 54321" "!destDir!\*.txt"
Conclusion
Converting PDFs to text serves as a vital preliminary step before unleashing the power of text parsing and string manipulation within Batch. While you must rely on a third-party binary like Poppler's pdftotext, integrating it into a FOR loop provides the heavy-lifting capability to process hundreds of documents silently and reliably. This opens up automated auditing, index searching, and pipeline ingestion for otherwise opaque file formats.