Channel on-call runbook: Telegram / Discord / WhatsApp

How to use this page

When a channel incident happens, do not start by changing random configs.

Use this order:

locate the failing layer
apply one fix only
re-run a quick regression immediately

On-call target

Lock these three targets into your dashboard first:

delivery success rate at or above 99 percent
p95 response latency within 8 seconds
reconnect count no more than 2 per channel per day

Required metrics

ingress volume per channel (5-minute windows)
error rate per channel by code type
session routing conflict count
reconnect count
last successful egress timestamp
channel timeout ratio

Alert thresholds (starter set)

P1: no successful egress for 5 minutes
P1: error rate above 10 percent for 10 minutes
P2: p95 latency above 12 seconds for 15 minutes
P2: reconnect count above 5 per hour

First-10-minute SOP

ingress check: are messages arriving?
routing check: is session key correct?
execution check: did model/tool call fail?
egress check: did platform API time out or reject?

Do not change multiple layers at once.

High-frequency incident mappings

Telegram out-of-order messages
Action: verify timestamp ordering and session-key grouping
Source: #45596
Discord event timeout
Action: inspect webhook delay, then downstream call timeout
Source: #45589
WhatsApp listener stopped after ~24h
Action: verify keepalive/reconnect policy, then auth status
Source: #45581

Supporting docs

Bottom line

If you keep only one channel operations document, keep this page.
It gives a repeatable incident flow that a new on-call engineer can execute without guessing.

Next step: Ops monitoring baseline.