How to Monitor Service Failures and Auto-Remediate in Batch Script
In a high-availability server environment, a crashed service (like a web server or a database engine) means downtime. While Windows Services have built-in Recovery options (configurable in the service properties), sometimes you need a more custom solution, like cleaning temporary files before restarting, checking dependencies, or sending an alert to a specific channel. A remediation script monitors the status of a service and, if it finds it stopped, performs a set of self-healing steps before bringing it back online.
This guide will explain how to build an auto-recovery watchdog for Windows Services.
Method 1: The Watchdog Monitor with Retry Limiting
This script checks a service's status at regular intervals. If the service is not running, it attempts recovery with custom pre-restart logic. A retry counter prevents infinite restart loops when the service has a persistent failure.
Implementation
@echo off
setlocal EnableDelayedExpansion
set "ServiceName=Spooler"
set "LogFile=%~dp0service_guardian.log"
set "Interval=60"
set "MaxRetries=3"
set "RetryWindow=3600"
set "RetryCount=0"
set "FirstFailTime=0"
:: Verify admin privileges (required for service control)
net session >nul 2>&1
if errorlevel 1 (
echo [ERROR] This script requires administrator privileges. >&2
echo Right-click and select "Run as administrator." >&2
endlocal
exit /b 1
)
:: Verify the service exists
sc query "%ServiceName%" >nul 2>&1
if errorlevel 1060 (
echo [ERROR] Service "%ServiceName%" does not exist. >&2
endlocal
exit /b 1
)
title Service Guardian: %ServiceName%
echo [%date% %time%] Guardian started for: %ServiceName% >> "%LogFile%"
echo [MONITOR] Watching service: %ServiceName%
echo [MONITOR] Check interval: %Interval% seconds
echo [MONITOR] Max retries: %MaxRetries% per %RetryWindow%-second window
echo [MONITOR] Press Ctrl+C to stop.
echo ------------------------------------------
:MonitorLoop
:: Check service status using sc query
:: Parse the STATE line for the current status
set "ServiceState=UNKNOWN"
for /f "tokens=4" %%s in ('sc query "%ServiceName%" ^| findstr "STATE"') do (
set "ServiceState=%%s"
)
if /i "!ServiceState!"=="RUNNING" (
:: Service is healthy - reset retry counter
if !RetryCount! gtr 0 (
echo [%date% %time%] [OK] %ServiceName% is running. Retry counter reset.
echo [%date% %time%] [OK] Service recovered. Retry counter reset. >> "%LogFile%"
set "RetryCount=0"
)
goto :Wait
)
:: Service is not running - determine the state
echo [%date% %time%] [ALERT] %ServiceName% is NOT running (state: !ServiceState!^)
echo [%date% %time%] [ALERT] %ServiceName% state: !ServiceState! >> "%LogFile%"
:: Check retry limit
if !RetryCount! geq %MaxRetries% (
echo [%date% %time%] [CRITICAL] Max retries ^(%MaxRetries%^) exceeded. Human intervention required. >&2
echo [%date% %time%] [CRITICAL] Max retries exceeded for %ServiceName%. >> "%LogFile%"
:: Write to Event Log for monitoring tool visibility
eventcreate /T ERROR /ID 901 /L APPLICATION /SO "ServiceGuardian" ^
/D "CRITICAL: %ServiceName% failed %MaxRetries% recovery attempts. Manual intervention required." >nul 2>&1
:: Exit rather than continuing to restart a broken service
endlocal
exit /b 1
)
:: Attempt recovery
set /a "RetryCount+=1"
echo [%date% %time%] [ACTION] Recovery attempt !RetryCount! of %MaxRetries%... >> "%LogFile%"
call :Remediate
:Wait
timeout /t %Interval% >nul
goto :MonitorLoop
:Remediate
:: =============================================
:: Step 1: Pre-restart cleanup (customize this)
:: =============================================
echo [ACTION] Performing pre-restart cleanup...
echo [%date% %time%] [ACTION] Pre-restart cleanup started >> "%LogFile%"
:: Example: delete lock files that prevent startup
:: if exist "C:\App\temp\lock.file" del "C:\App\temp\lock.file" /q
:: Example: clear a temp directory
:: if exist "C:\App\cache\*" del "C:\App\cache\*" /q 2>nul
:: =============================================
:: Step 2: Handle stuck STOP_PENDING state
:: =============================================
:: If the service is stuck in a pending state, force-kill its process
echo !ServiceState! | findstr /i "PENDING" >nul
if not errorlevel 1 (
echo [ACTION] Service is in a PENDING state. Attempting force termination...
echo [%date% %time%] [ACTION] Force-killing stuck service >> "%LogFile%"
:: Get the PID of the service process
for /f "tokens=2" %%p in ('sc queryex "%ServiceName%" ^| findstr "PID"') do (
if %%p neq 0 (
taskkill /f /pid %%p >nul 2>&1
timeout /t 3 >nul
)
)
)
:: =============================================
:: Step 3: Start the service
:: =============================================
echo [ACTION] Starting %ServiceName%...
net start "%ServiceName%" >nul 2>&1
:: Wait for the service to fully start (some services take time)
echo [ACTION] Waiting for service to reach RUNNING state...
set "StartWait=0"
:StartCheck
timeout /t 3 >nul
set /a "StartWait+=3"
sc query "%ServiceName%" | findstr /i "RUNNING" >nul
if not errorlevel 1 (
echo [%date% %time%] [OK] %ServiceName% recovered successfully (attempt !RetryCount!^).
echo [%date% %time%] [OK] Recovery successful (attempt !RetryCount!^) >> "%LogFile%"
eventcreate /T INFORMATION /ID 200 /L APPLICATION /SO "ServiceGuardian" ^
/D "Service %ServiceName% recovered after attempt !RetryCount!." >nul 2>&1
exit /b 0
)
if %StartWait% geq 30 (
echo [%date% %time%] [ERROR] %ServiceName% did not reach RUNNING state within 30 seconds.
echo [%date% %time%] [ERROR] Recovery attempt !RetryCount! failed >> "%LogFile%"
eventcreate /T WARNING /ID 500 /L APPLICATION /SO "ServiceGuardian" ^
/D "Service %ServiceName% recovery attempt !RetryCount! failed." >nul 2>&1
exit /b 1
)
goto :StartCheck
How the retry limiter works:
Without a retry limit, a service with a corrupt configuration or missing dependency would be restarted every 60 seconds indefinitely, generating thousands of log entries, consuming resources, and masking the real problem. The MaxRetries counter tracks consecutive failures. When the limit is reached, the script stops attempting recovery and exits with an error, escalating to human intervention.
The counter resets to zero when the service is found running, so transient failures (a service that crashes once but stays up after restart) don't accumulate toward the limit.
Method 2: Dependency-Aware Recovery
Services often depend on other services. If Service B depends on Service A, restarting B while A is down will always fail. This method checks dependencies before attempting recovery.
@echo off
setlocal EnableDelayedExpansion
set "ServiceName=MyWebApp"
set "DependsOn=MSSQLSERVER W3SVC"
set "LogFile=%~dp0dependency_recovery.log"
:: Verify admin privileges
net session >nul 2>&1
if errorlevel 1 (
echo [ERROR] Administrator privileges required. >&2
endlocal
exit /b 1
)
:: Check if the main service needs recovery
sc query "%ServiceName%" | findstr /i "RUNNING" >nul
if not errorlevel 1 (
echo [OK] %ServiceName% is running.
endlocal
exit /b 0
)
echo [ALERT] %ServiceName% is not running. Checking dependencies...
echo [%date% %time%] %ServiceName% is down. Checking dependencies. >> "%LogFile%"
:: Check each dependency
set "DepsOK=TRUE"
for %%d in (%DependsOn%) do (
sc query "%%d" | findstr /i "RUNNING" >nul
if errorlevel 1 (
echo [WARN] Dependency "%%d" is NOT running.
echo [%date% %time%] [WARN] Dependency %%d is not running >> "%LogFile%"
echo [ACTION] Starting dependency: %%d
net start "%%d" >nul 2>&1
if errorlevel 1 (
echo [ERROR] Failed to start dependency: %%d >&2
echo [%date% %time%] [ERROR] Could not start dependency %%d >> "%LogFile%"
set "DepsOK=FALSE"
) else (
echo [OK] Dependency %%d started.
echo [%date% %time%] [OK] Dependency %%d started >> "%LogFile%"
:: Wait for the dependency to stabilize
timeout /t 5 >nul
)
) else (
echo [OK] Dependency "%%d" is running.
)
)
:: Only attempt main service recovery if all dependencies are running
if "!DepsOK!"=="FALSE" (
echo [ERROR] Cannot recover %ServiceName% - dependencies are not available. >&2
echo [%date% %time%] [ERROR] Recovery blocked by dependency failure >> "%LogFile%"
endlocal
exit /b 1
)
echo [ACTION] All dependencies OK. Starting %ServiceName%...
net start "%ServiceName%" >nul 2>&1
:: Verify startup
timeout /t 10 >nul
sc query "%ServiceName%" | findstr /i "RUNNING" >nul
if not errorlevel 1 (
echo [OK] %ServiceName% recovered successfully.
echo [%date% %time%] [OK] %ServiceName% recovered >> "%LogFile%"
) else (
echo [ERROR] %ServiceName% failed to start even with dependencies running. >&2
echo [%date% %time%] [ERROR] %ServiceName% start failed >> "%LogFile%"
endlocal
exit /b 1
)
endlocal
exit /b 0
When to use dependency-aware recovery:
- Web applications that depend on SQL Server, IIS (W3SVC), or message queues
- Custom services that depend on network services (DNS Client, DHCP Client)
- Monitoring agents that depend on the Windows Event Log service
Windows Services have a built-in dependency mechanism, but net start does not automatically start dependencies. Your recovery script must handle this explicitly.
Method 3: Multi-Service Watchdog
For environments with multiple critical services, a single watchdog script can monitor all of them.
@echo off
setlocal EnableDelayedExpansion
set "Services=Spooler W3SVC MSSQLSERVER"
set "LogFile=%~dp0multi_service_watch.log"
set "Interval=60"
net session >nul 2>&1
if errorlevel 1 (
echo [ERROR] Administrator privileges required. >&2
endlocal
exit /b 1
)
title Multi-Service Watchdog
echo [MONITOR] Watching services: %Services%
echo [MONITOR] Interval: %Interval% seconds
echo ------------------------------------------
:WatchLoop
for %%s in (%Services%) do (
sc query "%%s" >nul 2>&1
if errorlevel 1060 (
echo [%date% %time%] [SKIP] Service "%%s" does not exist on this system.
) else (
sc query "%%s" | findstr /i "RUNNING" >nul
if errorlevel 1 (
echo [%date% %time%] [ALERT] %%s is NOT running!
echo [%date% %time%] [ALERT] %%s is down >> "%LogFile%"
net start "%%s" >nul 2>&1
if not errorlevel 1 (
echo [%date% %time%] [OK] %%s recovered.
echo [%date% %time%] [OK] %%s recovered >> "%LogFile%"
) else (
echo [%date% %time%] [ERROR] %%s recovery failed. >&2
echo [%date% %time%] [ERROR] %%s recovery failed >> "%LogFile%"
)
)
)
)
timeout /t %Interval% >nul
goto :WatchLoop
How to Avoid Common Errors
Wrong Way: Using net start to Check Status
net start ServiceName attempts to start the service. If it is already running, it produces an error message. This is noisy, logs false "failures," and can interfere with service operations.
Correct Way: Use sc query ServiceName to inspect the current state without modifying anything.
Problem: Infinite Restart Loops
If a service has a corrupt configuration file, a missing dependency, or a code bug, it will crash immediately after every restart. Without a retry limit, the watchdog restarts it every 60 seconds forever, generating thousands of log entries and masking the real problem.
Solution: Implement a retry counter (as shown in Method 1). After a configurable number of consecutive failures, stop recovery attempts, log a critical alert, write to the Event Log, and exit.
Problem: Services Stuck in PENDING States
A service in STOP_PENDING or START_PENDING cannot be started with net start. The command will fail or hang.
Solution: Method 1 detects pending states and uses sc queryex to find the service's PID, then taskkill /f /pid to force-terminate the stuck process before attempting a restart.
Problem: Starting Before Dependencies Are Ready
If Service B depends on Service A, net start B will fail while A is still starting. Simply running net start B immediately after net start A is unreliable.
Solution: After starting each dependency, wait a few seconds and verify it reached RUNNING state before proceeding (as shown in Method 2).
Problem: Recovery Without Admin Rights
net start and sc commands that modify service state require administrator privileges. A watchdog running as a standard user will detect failures but fail silently on every recovery attempt.
Solution: Check for admin rights at startup (as shown in all methods). Deploy the watchdog as a scheduled task configured to run with highest privileges.
Best Practices and Rules
1. Always Check Admin Rights at Startup
Service control requires elevation. All monitoring scripts should verify admin status before entering the monitoring loop, failing with a clear error message rather than silently failing on every recovery attempt.
2. Log Every Action and Outcome
Record the timestamp, service name, detected state, action taken, and result. Without this detail, you cannot distinguish between a service that recovered on the first attempt and one that is cycling through failures.
3. Write to the Event Log for Visibility
Text log files are useful for the script administrator, but enterprise monitoring tools (SCOM, Zabbix, Splunk) watch the Windows Event Log. Write significant events (recovery success, recovery failure, max retries exceeded) using eventcreate so monitoring tools can trigger their own alerts.
4. Wait After Starting a Service
Some services take 10–30 seconds to move from START_PENDING to RUNNING. Do not check the status immediately after net start, wait at least 5–10 seconds, then poll until the service reaches RUNNING or a timeout expires.
5. Consider Windows Built-In Recovery First
Before building a custom watchdog, check the service's Recovery tab (in services.msc > Properties > Recovery). Windows can automatically restart a service, run a program, or reboot the computer on failure. Use a custom script only when you need logic that the built-in recovery cannot provide (pre-restart cleanup, dependency checking, custom alerting).
6. Deploy as a Scheduled Task
A monitoring script must survive reboots. Configure it as a scheduled task that starts at system boot, runs with highest privileges, and restarts if the task stops unexpectedly.
Conclusions
Auto-remediation is a core pillar of modern system reliability. By replacing manual restarts with a watchdog script, complete with retry limiting, dependency checking, and Event Log integration, you significantly reduce downtime and ensure that transient failures are resolved without human intervention. This proactive approach allows your infrastructure to be self-healing for common failure modes while escalating persistent problems to administrators before they cause extended outages.