How to use this page
When a channel incident happens, do not start by changing random configs.
Use this order:
- locate the failing layer
- apply one fix only
- re-run a quick regression immediately
On-call target
Lock these three targets into your dashboard first:
- delivery success rate at or above 99 percent
- p95 response latency within 8 seconds
- reconnect count no more than 2 per channel per day
Required metrics
- ingress volume per channel (5-minute windows)
- error rate per channel by code type
- session routing conflict count
- reconnect count
- last successful egress timestamp
- channel timeout ratio
Alert thresholds (starter set)
P1: no successful egress for 5 minutesP1: error rate above 10 percent for 10 minutesP2: p95 latency above 12 seconds for 15 minutesP2: reconnect count above 5 per hour
First-10-minute SOP
- ingress check: are messages arriving?
- routing check: is session key correct?
- execution check: did model/tool call fail?
- egress check: did platform API time out or reject?
Do not change multiple layers at once.
High-frequency incident mappings
-
Telegram out-of-order messages
Action: verify timestamp ordering and session-key grouping
Source: #45596 -
Discord event timeout
Action: inspect webhook delay, then downstream call timeout
Source: #45589 -
WhatsApp listener stopped after ~24h
Action: verify keepalive/reconnect policy, then auth status
Source: #45581
Supporting docs
Bottom line
If you keep only one channel operations document, keep this page.
It gives a repeatable incident flow that a new on-call engineer can execute without guessing.
Next step: Ops monitoring baseline.