Skip to main content

How to Implement a Checkpoint/Resume Mechanism in Batch script

If you have a script that takes 3 hours to run (e.g., a massive data migration or a deep system audit), a single crash at hour 2 can be devastating if you have to start over from the beginning. A "Checkpoint/Resume" mechanism allows your script to remember which tasks it has already completed. If the script is interrupted, it can check its internal "Save Game" and skip directly to the next unfinished task.

This guide will explain how to build a resilient multi-stage script with checkpointing logic.

The Logic: The Stage Pointer

To handle checkpoints, we define the script as a series of numbered stages. Every time a stage finishes successfully, we update a permanent "Checkpoint File."

Logic Flow:

  1. Load: Read the stage number from checkpoint.txt.
  2. Jump: Use GOTO to bypass already finished stages.
  3. Perform: Run the current task.
  4. Mark: Update checkpoint.txt with the next stage number.

Implementation: The Resilient Pipeline

@echo off
setlocal enabledelayedexpansion

set "CheckFile=%~dp0progress.ckp"
set "TotalStages=3"

:: Support a --fresh flag to restart from the beginning
if /i "%~1"=="--fresh" (
if exist "%CheckFile%" del "%CheckFile%" >nul 2>&1
echo [INFO] Checkpoint cleared. Starting fresh.
)

:: 1. Load progress (default to stage 0 = not started)
set "last_stage=0"
if exist "%CheckFile%" (
set /p "last_stage=" < "%CheckFile%"
)

:: Validate the checkpoint value
if "!last_stage!"=="" set "last_stage=0"

if !last_stage! gtr 0 (
echo [RESUME] Resuming from after Stage !last_stage! of %TotalStages%...
) else (
echo [START] Beginning pipeline (%TotalStages% stages^)...
)
echo.

:: 2. The Jump Logic - skip completed stages
if !last_stage! geq 1 goto :STAGE_2
if !last_stage! geq 2 goto :STAGE_3
if !last_stage! geq 3 goto :COMPLETE

:STAGE_1
echo [1/%TotalStages%] Downloading large assets...
:: === Stage 1 work here ===
timeout /t 5 /nobreak >nul
:: === End of Stage 1 ===

:: Verify success before checkpointing
if !errorlevel! neq 0 (
echo [ERROR] Stage 1 failed. Checkpoint NOT updated.
pause
exit /b 1
)
echo 1 > "%CheckFile%"
echo [OK] Stage 1 complete. Checkpoint saved.

:STAGE_2
echo [2/%TotalStages%] Processing data blocks...
:: === Stage 2 work here ===
timeout /t 5 /nobreak >nul
:: === End of Stage 2 ===

if !errorlevel! neq 0 (
echo [ERROR] Stage 2 failed. Checkpoint NOT updated.
pause
exit /b 1
)
echo 2 > "%CheckFile%"
echo [OK] Stage 2 complete. Checkpoint saved.

:STAGE_3
echo [3/%TotalStages%] Uploading results...
:: === Stage 3 work here ===
timeout /t 5 /nobreak >nul
:: === End of Stage 3 ===

if !errorlevel! neq 0 (
echo [ERROR] Stage 3 failed. Checkpoint NOT updated.
pause
exit /b 1
)

:COMPLETE
:: Cleanup checkpoint file on successful completion
if exist "%CheckFile%" del "%CheckFile%" >nul 2>&1
echo.
echo [SUCCESS] All %TotalStages% tasks completed!

pause
endlocal

Advanced: Step-Level Checkpoints (The Registry Method)

If your script manages system settings, using the Registry as a checkpoint is cleaner and less likely to be accidentally deleted by the user.

@echo off
setlocal

set "RegKey=HKCU\Software\MyProject\Checkpoints"

:: Support a --fresh flag
if /i "%~1"=="--fresh" (
reg delete "%RegKey%" /f >nul 2>&1
echo [INFO] All checkpoints cleared.
)

:: Step 1
reg query "%RegKey%" /v "Step1_Done" >nul 2>&1
if %errorlevel% equ 0 (
echo [SKIP] Step 1 already completed.
) else (
echo [RUN] Step 1: Configuring system settings...
rem === Step 1 work here ===
timeout /t 3 /nobreak >nul
rem === End of Step 1 ===

if not errorlevel 1 (
reg add "%RegKey%" /v "Step1_Done" /t REG_DWORD /d 1 /f >nul 2>&1
echo [OK] Step 1 complete. Checkpoint saved.
) else (
echo [ERROR] Step 1 failed. Will retry on next run.
pause
exit /b 1
)
)

:: Step 2
reg query "%RegKey%" /v "Step2_Done" >nul 2>&1
if %errorlevel% equ 0 (
echo [SKIP] Step 2 already completed.
) else (
echo [RUN] Step 2: Installing components...
rem === Step 2 work here ===
timeout /t 3 /nobreak >nul
rem === End of Step 2 ===

if not errorlevel 1 (
reg add "%RegKey%" /v "Step2_Done" /t REG_DWORD /d 1 /f >nul 2>&1
echo [OK] Step 2 complete. Checkpoint saved.
) else (
echo [ERROR] Step 2 failed. Will retry on next run.
pause
exit /b 1
)
)

:: All steps done - clean up checkpoints
echo.
echo [SUCCESS] All steps complete.
reg delete "%RegKey%" /f >nul 2>&1

pause
endlocal
warning

Handle Dependent Steps Carefully. If Step 2 relies on a variable created in Step 1, and you jump directly to Step 2, that variable will be missing. Solution: Re-calculate or re-load the necessary environment variables at the start of every stage, or store them in the checkpoint file/registry.

How to Avoid Common Errors

Wrong Way: Marking the checkpoint at the START of a block

If your script marks "Stage 1 Done" and then immediately crashes while performing Stage 1, the resume logic will think Stage 1 finished successfully, leaving your data incomplete.

Correct Way: Only update the checkpoint file/registry after the command has returned an %errorlevel% of 0 (as shown in both methods).

Problem: Stale Checkpoints

If you fix a bug in Stage 1 and want to run the whole script again, the checkpoint will force you to skip to the end.

Best Practice: Add a --fresh or /reset flag to your script that deletes the checkpoint before starting (as demonstrated in both methods).

Problem: Jump logic skipping to the wrong stage

If the jump comparisons are in the wrong order or use incorrect operators, the script can skip stages or run completed stages again.

Best Practice: Order the jump checks from highest to lowest completed stage number. The first matching condition sends execution to the correct resume point.

Best Practices and Rules

1. Granularity

Don't make your stages too long. Instead of one stage called "Install Software," use three: "Extract," "Verify," and "Execute." Smaller checkpoints mean less wasted time during a resume.

2. Validation

When resuming at Stage 5, do a quick "Sanity Check" (e.g., does the output file from Stage 4 actually exist?). If the previous stage's data is missing, reset the checkpoint to Stage 4.

3. User Feedback

If the script resumes, tell the user which stage it is resuming from and the total number of stages (as shown in the main implementation). This builds trust in the automation's intelligence.

Conclusions

Implementing a checkpoint/resume mechanism is one of the most effective ways to build "Industrial Strength" automation. By giving your scripts a memory that survives crashes and reboots, you eliminate the frustration of repetitive manual restarts and ensure that complex, long-running processes eventually reach the finish line. This structural resilience is what separates basic scripts from reliable enterprise-grade system tools.