Skip to main content

How to Handle Unicode Filenames in Batch Script

By default, the Windows Command Prompt uses an old character encoding (Code Page 437 or 850) that dates back to the DOS era. If your script tries to process a file with a name containing Cyrillic characters, accented vowels, or CJK ideographs (e.g., rapporto_finale_é.txt), it will likely fail with a "File Not Found" error. To handle modern, international filenames, your script must explicitly switch the environment's encoding to "UTF-8."

This guide will explain how to configure your Batch script to support Unicode filenames.

The Key: CHCP 65001

The chcp command (Change Code Page) allows you to swap the console's character set. Code page 65001 is the Windows identifier for UTF-8.

Implementation Template

@echo off
setlocal

:: 1. Save the current code page (handles localized chcp output)
set "old_cp="
for /f "tokens=2 delims=:." %%a in ('chcp 2^>nul') do set /a "old_cp=%%a" 2>nul

:: 2. Switch to UTF-8
chcp 65001 >nul
if errorlevel 1 (
echo [ERROR] Failed to set code page 65001. >&2
endlocal
exit /b 1
)

:: 3. Perform tasks involving Unicode filenames
echo Listing files with special characters...
dir /b *.txt

:: 4. Restore the original code page before exiting
if defined old_cp chcp %old_cp% >nul 2>&1

endlocal
exit /b 0

Why the save/restore logic works this way:

The output of chcp varies by Windows locale. English systems print Active code page: 437, while German systems print Aktive Codepage: 850. (with a trailing period). Using both : and . as delimiters in the for /f command handles both formats. The set /a assignment strips any remaining whitespace from the extracted number. The restore step checks if defined old_cp to avoid running chcp with an empty value if the save step failed.

Method 2: Iterating Unicode Filenames in Loops

When iterating over files with Unicode names, prefer the plain for loop over for /f with dir /b. The plain for loop retrieves filenames directly from the filesystem API, which handles Unicode natively. The for /f approach pipes text through dir /b, which can corrupt multi-byte characters during the text-to-console-to-parser roundtrip, even with chcp 65001 active.

@echo off
setlocal

set "old_cp="
for /f "tokens=2 delims=:." %%a in ('chcp 2^>nul') do set /a "old_cp=%%a" 2>nul
chcp 65001 >nul

:: RECOMMENDED: plain for loop - uses filesystem API directly
for %%a in ("C:\Data\*.txt") do (
echo Processing: "%%~nxa"
if exist "%%a" echo [OK] System sees the file.
)

:: For recursive searches, use for /r
for /r "C:\Data" %%a in (*.txt) do (
echo Found: "%%a"
)

if defined old_cp chcp %old_cp% >nul 2>&1

endlocal
exit /b 0

Why plain for is safer than for /f with dir /b:

The for /f command works by capturing the text output of a command and parsing it line by line. That text output must pass through the console's code page encoding, which can corrupt characters that don't survive the roundtrip, especially CJK characters and emoji. The plain for and for /r loops bypass this entirely by reading filenames directly from the filesystem, making them significantly more reliable for Unicode.

When you must use for /f:

If you need features only for /f provides (such as parsing structured output from another command), use delims= to capture the full line including any leading spaces:

for /f "delims=" %%a in ('dir /b "C:\Data\*.txt"') do (
echo Processing: "%%a"
)

Be aware that this may still corrupt certain Unicode characters. Test with your specific filenames.

Method 3: Fonts and Display

Even if your script handles a file correctly at the filesystem level, you might still see boxes (□□□) or question marks on the screen. This is a limitation of the console font, not your script logic.

To verify: if the script correctly copies, moves, or deletes the file despite garbled display, the logic is working, only the visual rendering is affected.

For the best Unicode display, change your Command Prompt font to a modern TrueType font like Consolas, Lucida Console, or Cascadia Code. The legacy "Raster Fonts" option does not contain Unicode glyphs and cannot display them. You can change the font by right-clicking the console title bar, selecting Properties, and choosing a TrueType font on the Font tab.

How to Avoid Common Errors

Problem: Saving the Script File with the Wrong Encoding

If you embed Unicode characters directly inside your Batch script (e.g., set "greeting=¡Hola!"), the .bat file itself must be saved in an encoding that preserves those characters. Saving as ANSI (the default in many editors) will corrupt them before the script ever runs.

Solution: Save the .bat file as UTF-8 with BOM (Byte Order Mark). On Windows 10 and later, cmd.exe recognizes the UTF-8 BOM and reads the file accordingly. However, be aware of two caveats:

  • Older Windows versions (before Windows 10) do not recognize the BOM and may treat it as garbage characters prepended to your first command, causing @echo off to fail silently.
  • The safest approach is to avoid embedding Unicode literals in the script text altogether. Instead, let Unicode data enter through external sources: filenames from the filesystem, content read from files, or values passed as parameters.

Problem: Piping and Redirection with Unicode

Commands like type, findstr, and more can corrupt UTF-8 data when piped, because each side of a pipe runs in a separate cmd.exe instance that may not inherit the code page setting.

Solution: For heavy Unicode text processing, delegate to PowerShell, which handles UTF-8 and UTF-16 natively:

powershell -NoProfile -Command "Get-Content -Encoding UTF8 'report_é.txt'"

Problem: for /f Mangling Unicode Output

As described in Method 2, for /f parses text that has passed through the console code page. Multi-byte UTF-8 sequences can be split or misinterpreted during this conversion.

Solution: Use the plain for or for /r loop when iterating over filenames. Reserve for /f for non-Unicode tasks, or accept that it may produce incorrect results for certain characters.

Best Practices and Rules

1. Quote Every Path Variable

Unicode filenames are much more likely to contain characters that break unquoted variable expansion. Always wrap file paths in quotes: "%var%", never bare %var%.

2. Always Restore the Code Page

Some legacy command-line tools (like older database drivers and certain installers) will malfunction if the code page is 65001 when they run. Always restore the original code page as the last action in your script, including on error paths.

:Cleanup
if defined old_cp chcp %old_cp% >nul 2>&1
endlocal
exit /b 1

3. Know the Limitations

The chcp 65001 approach works well for common Western European accented characters and many scripts, but cmd.exe is not a fully Unicode-aware shell. For automation that must reliably handle the full Unicode range (emoji, CJK, right-to-left scripts), consider using PowerShell or Python as the primary scripting language instead.

Conclusions

Handling Unicode filenames in Batch requires a small but vital adjustment to the environment. By switching to Code Page 65001 and using filesystem-native loops instead of text-parsed ones, you bridge the gap between legacy DOS logic and the modern international file system. Understand the limitations: cmd.exe was not designed for full Unicode support, and delegate to more capable tools like PowerShell when your requirements exceed what the Batch environment can reliably deliver.