Articles

Channel on-call runbook: Telegram / Discord / WhatsApp

One-page operational SOP with concrete metrics, thresholds, and first-10-minute actions.

How to use this page

When a channel incident happens, do not start by changing random configs.

Use this order:

  1. locate the failing layer
  2. apply one fix only
  3. re-run a quick regression immediately

On-call target

Lock these three targets into your dashboard first:

  1. delivery success rate at or above 99 percent
  2. p95 response latency within 8 seconds
  3. reconnect count no more than 2 per channel per day

Required metrics

  1. ingress volume per channel (5-minute windows)
  2. error rate per channel by code type
  3. session routing conflict count
  4. reconnect count
  5. last successful egress timestamp
  6. channel timeout ratio

Alert thresholds (starter set)

  • P1: no successful egress for 5 minutes
  • P1: error rate above 10 percent for 10 minutes
  • P2: p95 latency above 12 seconds for 15 minutes
  • P2: reconnect count above 5 per hour

First-10-minute SOP

  1. ingress check: are messages arriving?
  2. routing check: is session key correct?
  3. execution check: did model/tool call fail?
  4. egress check: did platform API time out or reject?

Do not change multiple layers at once.

High-frequency incident mappings

  1. Telegram out-of-order messages
    Action: verify timestamp ordering and session-key grouping
    Source: #45596

  2. Discord event timeout
    Action: inspect webhook delay, then downstream call timeout
    Source: #45589

  3. WhatsApp listener stopped after ~24h
    Action: verify keepalive/reconnect policy, then auth status
    Source: #45581

Supporting docs

Bottom line

If you keep only one channel operations document, keep this page.
It gives a repeatable incident flow that a new on-call engineer can execute without guessing.

Next step: Ops monitoring baseline.