A bot that crashes silently at 3am loses money for hours before anyone notices. The cheapest insurance you can buy as an algo trader is the alert that wakes you up when something is wrong.
This article walks through what to monitor on a live trading bot, how to wire up Telegram and healthchecks in Python, and the alert thresholds that actually matter.
Three failure modes account for the vast majority of bot downtime:
1. **Process crashes** — unhandled exception, OOM kill, disk full. Bot stops. Stop-loss orders may not fire. Open positions sit unmanaged.
2. **API errors** — exchange rate limits, transient network issues, key expiry. Bot is running but every order fails.
3. **Behavioural drift** — bot is "running" but trading wildly differently than expected. Position size 10× normal, or signals firing every minute instead of every hour.
The first two are easy to detect — the bot is either alive or it isn't. The third one is harder, and tends to be the most expensive when it happens.
Five signals cover most of what matters:
A complete monitoring setup checks all five. A minimum setup checks the first three.
Telegram is the path of least resistance. Create a bot via @BotFather, get your chat ID, drop the credentials into environment variables:
Wire `alert()` into your bot's exception handler, order placement logic, and end-of-day P&L summary. Three example uses:
That's enough to catch the catastrophic cases.
Process-up monitoring needs an external watcher — the bot can't tell you it's down. Healthchecks.io handles this with a simple ping pattern: the bot pings a URL every N minutes, and if it stops pinging, you get an email.
Set the healthcheck grace period to ~2× your loop interval. If your bot pings every minute, set grace to 3 minutes. Missed pings → email + push notification.
Free tier covers one healthcheck, enough for a single bot.
Local state and exchange state should match. They sometimes don't — restarts, missed websocket updates, partial fills logged incorrectly. Drift detection runs every 5-10 minutes:
Pause-on-drift is harsher than necessary in most cases — usually the issue is a precision rounding mismatch. But pausing is the safe default. The alert tells you to investigate before resuming.
Alert systems have their own failure modes:
**Alert fatigue.** If you're getting 50 alerts a day, you'll start ignoring them. The Telegram chat needs to surface only what actually requires action. Reserve "critical" for things that need a human in the next 10 minutes; "warning" for things worth looking at within an hour; everything else stays in the log file.
**Single channel.** Telegram down means no alerts. For a bot trading meaningful size, pair Telegram with email or SMS as a backup.
**Self-referential failures.** If the alert code itself crashes (bad token, network gone), you'll never know. Test the alert path during deploy by triggering a fake info-level event and confirming it arrives.
**No way to silence.** Sometimes you're doing maintenance and don't want pages. Build a "muted until" flag into the alert function. Don't rely on phone DND — you'll forget you set it.
Q: How often should the heartbeat ping run?
Once per main loop iteration. For a 1-minute loop, that's every minute. For a 1-hour loop, that's every hour. The heartbeat is meaningful only if it catches a real silent failure quickly.
Q: Should alerts trigger automated actions?
For some, yes — drawdown breach → pause bot. Position drift → pause bot. For most, no — they're informational. Automating responses to ambiguous signals creates a second layer of bugs that you also need to monitor.
Q: Is paid monitoring worth it?
For a single bot, no. Free Healthchecks.io + Telegram covers 95% of cases. For a portfolio of 5+ bots or institutional setups, structured monitoring (Datadog, Grafana, PagerDuty) pays off because the dashboards become useful.
Q: What do I do when an alert wakes me up?
Have a runbook. For each alert type, the runbook should answer: what does this mean, what's the immediate mitigation, what's the diagnostic step. Without a runbook, 3am decisions are guaranteed to be bad ones.