RISK MANAGEMENT

Bot Monitoring and Alerts: Knowing When Your Trading Bot Stops Working

June 23, 2026 · 7 min read · LMEX.AI

A bot that crashes silently at 3am loses money for hours before anyone notices. The cheapest insurance you can buy as an algo trader is the alert that wakes you up when something is wrong.

This article walks through what to monitor on a live trading bot, how to wire up Telegram and healthchecks in Python, and the alert thresholds that actually matter.

Why silent failures cost real money

Three failure modes account for the vast majority of bot downtime:

1. **Process crashes** — unhandled exception, OOM kill, disk full. Bot stops. Stop-loss orders may not fire. Open positions sit unmanaged.

2. **API errors** — exchange rate limits, transient network issues, key expiry. Bot is running but every order fails.

3. **Behavioural drift** — bot is "running" but trading wildly differently than expected. Position size 10× normal, or signals firing every minute instead of every hour.

The first two are easy to detect — the bot is either alive or it isn't. The third one is harder, and tends to be the most expensive when it happens.

What to monitor

Five signals cover most of what matters:

**Process liveness** — is the bot's systemd service active? Easy to check, easy to alert on.

**Last action timestamp** — when did the bot last place, modify, or cancel an order? If a market-making bot hasn't requoted in 5 minutes, something is wrong.

**Open position vs expected** — does the actual position on the exchange match what the bot thinks it has? Drift here means the local state is corrupted.

**Drawdown** — has the equity curve breached the daily loss limit?

**Order success rate** — what fraction of orders submitted in the last hour got filled or rejected? Falling success rate signals API problems.

A complete monitoring setup checks all five. A minimum setup checks the first three.

Telegram alerts in Python

Telegram is the path of least resistance. Create a bot via @BotFather, get your chat ID, drop the credentials into environment variables:

import os
import requests

TELEGRAM_TOKEN = os.getenv('TELEGRAM_TOKEN')
TELEGRAM_CHAT_ID = os.getenv('TELEGRAM_CHAT_ID')

def alert(message, level='info'):
    icons = {'info': 'ℹ️', 'warning': '⚠️', 'critical': '🚨'}
    text = f"{icons.get(level, '')} {message}"
    url = f"https://api.telegram.org/bot{TELEGRAM_TOKEN}/sendMessage"
    try:
        requests.post(url, json={'chat_id': TELEGRAM_CHAT_ID, 'text': text}, timeout=5)
    except Exception as e:
        print(f"Alert failed: {e}")

Wire `alert()` into your bot's exception handler, order placement logic, and end-of-day P&L summary. Three example uses:

# On startup
alert(f"Bot started: {symbol} strategy v{VERSION}", 'info')

# On unexpected exception
try:
    bot.step()
except Exception as e:
    alert(f"Bot crashed: {type(e).__name__}: {e}", 'critical')
    raise

# On drawdown breach
if daily_loss > MAX_DAILY_LOSS:
    alert(f"Daily loss limit hit: ${daily_loss:.0f} / ${MAX_DAILY_LOSS}", 'critical')
    bot.pause()

That's enough to catch the catastrophic cases.

Healthchecks for liveness

Process-up monitoring needs an external watcher — the bot can't tell you it's down. Healthchecks.io handles this with a simple ping pattern: the bot pings a URL every N minutes, and if it stops pinging, you get an email.

import requests

HEALTHCHECK_URL = os.getenv('HEALTHCHECK_URL')

def ping():
    try:
        requests.get(HEALTHCHECK_URL, timeout=5)
    except Exception:
        pass

# In your main loop
while True:
    bot.step()
    ping()
    time.sleep(60)

Set the healthcheck grace period to ~2× your loop interval. If your bot pings every minute, set grace to 3 minutes. Missed pings → email + push notification.

Free tier covers one healthcheck, enough for a single bot.

Position drift detection

Local state and exchange state should match. They sometimes don't — restarts, missed websocket updates, partial fills logged incorrectly. Drift detection runs every 5-10 minutes:

def check_position_drift():
    local = bot.position_size  # whatever the bot thinks
    actual = exchange.fetch_positions([symbol])[0]['contracts']
    
    if abs(local - actual) > 0.001:
        alert(f"Position drift detected: local={local}, exchange={actual}", 'critical')
        bot.pause()  # safer than continuing with wrong state

Pause-on-drift is harsher than necessary in most cases — usually the issue is a precision rounding mismatch. But pausing is the safe default. The alert tells you to investigate before resuming.

Where this falls apart

Alert systems have their own failure modes:

**Alert fatigue.** If you're getting 50 alerts a day, you'll start ignoring them. The Telegram chat needs to surface only what actually requires action. Reserve "critical" for things that need a human in the next 10 minutes; "warning" for things worth looking at within an hour; everything else stays in the log file.

**Single channel.** Telegram down means no alerts. For a bot trading meaningful size, pair Telegram with email or SMS as a backup.

**Self-referential failures.** If the alert code itself crashes (bad token, network gone), you'll never know. Test the alert path during deploy by triggering a fake info-level event and confirming it arrives.

**No way to silence.** Sometimes you're doing maintenance and don't want pages. Build a "muted until" flag into the alert function. Don't rely on phone DND — you'll forget you set it.

Frequently Asked Questions

Q: How often should the heartbeat ping run?

Once per main loop iteration. For a 1-minute loop, that's every minute. For a 1-hour loop, that's every hour. The heartbeat is meaningful only if it catches a real silent failure quickly.

Q: Should alerts trigger automated actions?

For some, yes — drawdown breach → pause bot. Position drift → pause bot. For most, no — they're informational. Automating responses to ambiguous signals creates a second layer of bugs that you also need to monitor.

Q: Is paid monitoring worth it?

For a single bot, no. Free Healthchecks.io + Telegram covers 95% of cases. For a portfolio of 5+ bots or institutional setups, structured monitoring (Datadog, Grafana, PagerDuty) pays off because the dashboards become useful.

Q: What do I do when an alert wakes me up?

Have a runbook. For each alert type, the runbook should answer: what does this mean, what's the immediate mitigation, what's the diagnostic step. Without a runbook, 3am decisions are guaranteed to be bad ones.

→ Deploying Your Trading Bot on a Linux VPS: Complete Setup Guide

→ Why Most Trading Bots Fail (And What the Survivors Get Right)

→ Portfolio Risk Management for Algorithmic Traders on LMEX

← All Articles Build a Bot →