Skip to main content

How to Convert a PDF to Text in Batch Script

Batch scripts natively process text and strings but lack the built-in capability to parse binary formats like PDFs. To extract text from a PDF for automated analysis or indexing, you must leverage a command-line utility. The open-source pdftotext from the Poppler utilities is the industry standard for this task.

In this guide, we will demonstrate how to automate the conversion of a single PDF and a directory of PDFs to plain text using pdftotext.exe.

Prerequisites

Installing pdftotext

pdftotext.exe is part of the open source Poppler utilities. It is not included with Windows by default. You can download prebuilt Windows binaries from:

Extract pdftotext.exe and its required DLLs from the /bin folder into a folder in your system PATH, or place it in the same folder as your Batch script.

Verify it is installed correctly by running:

pdftotext -v

The Strategy: Integrating the Utility

  1. Download and install the Poppler utilities.
  2. Place pdftotext.exe in your PATH or alongside your script.
  3. Use Batch's FOR loop to execute the utility.
  4. Optionally process the resulting .txt files.

Single File Conversion

The syntax for pdftotext is straightforward: pdftotext [options] <PDF-file> [<text-file>].

@echo off
setlocal enabledelayedexpansion

:: Define the target PDF
set "inputFile=document.pdf"

:: Verify pdftotext is available
where pdftotext >nul 2>nul
if !errorlevel! NEQ 0 (
echo [ERROR] pdftotext.exe not found. Ensure it is in your PATH or alongside this script.
pause
exit /b 1
)

:: Ensure the file exists
if not exist "!inputFile!" (
echo [ERROR] !inputFile! not found!
pause
exit /b 1
)

:: Execute conversion
echo Converting !inputFile! to text...
pdftotext -layout "!inputFile!"

if !errorlevel! EQU 0 (
echo Conversion complete. Check "!inputFile:~0,-4!.txt"
) else (
echo [ERROR] Conversion failed with exit code !errorlevel!
)

pause
Layout vs. Raw Mode
  • Use -layout (default for this guide) to preserve columns, tables, indentation and page structure. This is best for invoices, reports and formatted documents.
  • Use -raw to output text in reading order with no layout preservation. This is best for subsequent text searching and parsing with findstr.

Batch Processing an Entire Directory

If you need to extract text from hundreds of invoices or reports, wrap the conversion inside a FOR loop.

@echo off
setlocal enabledelayedexpansion

:: Define the source and destination directories
set "sourceDir=C:\Invoices_PDF"
set "destDir=C:\Invoices_Text"

:: Verify pdftotext is available
where pdftotext >nul 2>nul
if !errorlevel! NEQ 0 (
echo [ERROR] pdftotext.exe not found.
pause
exit /b 1
)

:: Validate source directory
if not exist "!sourceDir!\" (
echo [ERROR] Source directory does not exist: !sourceDir!
pause
exit /b 1
)

:: Create the destination folder if missing
if not exist "!destDir!" mkdir "!destDir!"

echo Starting batch conversion...
echo.

set "processed=0"
set "failed=0"

:: Loop through all PDFs
for %%f in ("!sourceDir!\*.pdf") do (
echo Processing: %%~nxf

pdftotext -layout -q -enc UTF-8 "%%f" "!destDir!\%%~nf.txt"

if !errorlevel! EQU 0 (
set /a "processed+=1"
) else (
echo [FAILED] %%~nxf
set /a "failed+=1"
)
)

echo.
echo ==========================================
echo Conversion complete.
echo Processed: !processed! files
echo Failed: !failed! files
echo Results saved in !destDir!
echo ==========================================

if !processed! EQU 0 (
echo [WARNING] No PDF files were found in !sourceDir!
)

pause
Recursive Subdirectory Processing

To process all PDFs in all subdirectories recursively, replace the for loop with:

for /r "!sourceDir!" %%f in (*.pdf) do (
pdftotext -layout -q -enc UTF-8 "%%f" "!destDir!\%%~nf.txt"
)

Why Convert a PDF to Text?

  1. Automated Searching: You can't use FINDSTR (the native grep-like search in Batch) on a PDF. By converting a folder of 500 reports to .txt, you can instantly FINDSTR "Account 54321" *.txt.
  2. Data Extraction: Extracting line items, totals, or metadata from recurring system-generated PDFs (like billing statements) to feed an SQL database.
  3. Archiving: Reducing the storage footprint when only the textual content of a document matters (e.g., long e-discovery or compliance log reports).

Important Considerations

Scanned PDFs

pdftotext extracts the embedded text layer from the PDF. If the PDF consists of scanned images (pictures of text), the utility will export an empty or near-empty file. In this case, you need an OCR (Optical Character Recognition) tool like Tesseract instead.

Password Protected PDFs

For encrypted or password protected PDFs, use the password flags:

  • -opw password for owner password
  • -upw password for user password
pdftotext -upw "MyPassword123" "protected.pdf" "output.txt"
Unicode and Encoding

Always add the -enc UTF-8 flag to properly handle foreign characters, copyright symbols, smart quotes, and other special characters. Without this flag, the default encoding may produce garbled output for non-ASCII text.

Post-Conversion Searching

The most common reason to convert PDFs to text is to enable searching. After conversion you can use Batch's native findstr to search across all documents:

:: Search all converted text files for "Account 54321"
findstr /i /s "Account 54321" "!destDir!\*.txt"

Conclusion

Converting PDFs to text serves as a vital preliminary step before unleashing the power of text parsing and string manipulation within Batch. While you must rely on a third-party binary like Poppler's pdftotext, integrating it into a FOR loop provides the heavy-lifting capability to process hundreds of documents silently and reliably. This opens up automated auditing, index searching, and pipeline ingestion for otherwise opaque file formats.