Articles

Ops implementation: Cron + Heartbeat + Auth Monitoring

Concrete starter setup with schedules, thresholds, and daily operator checklist.

What this page does

This page is for day-one operations hardening. You can use it to move from "it runs" to "it is operable."

Today’s minimum deliverable

Ship these three jobs first:

  1. health check every 5 minutes
  2. auth expiry check every hour
  3. daily execution report at 09:00

Reference:

Starter schedule (copy and adapt)

*/5 * * * * /opt/openclaw/ops/health_check.sh
0 * * * * /opt/openclaw/ops/auth_expiry_check.sh
0 9 * * * /opt/openclaw/ops/daily_job_report.sh

If you do not use cron, keep equivalent cadence.

Required output fields

For each job, log:

  • taskId
  • channel
  • status
  • durationMs
  • retryCount
  • errorCode (when failed)

Without these fields, incident replay becomes slow.

Alert thresholds

  • P1: 3 consecutive health-check failures in 10 minutes
  • P1: auth invalid on any critical channel
  • P2: hourly failure rate above 5 percent
  • P2: p95 runtime above 2 times baseline

Daily 10-minute operator checklist

  1. check if yesterday success rate dropped below 98%
  2. inspect top 5 failed jobs
  3. check retry trend
  4. verify tokens expiring within 24h
  5. replay at least one incident chain

Bottom line

Cron answers trigger timing, Heartbeat answers liveness, Auth Monitoring answers callability.
You need all three to run OpenClaw as an operated system.

Next step: Channel reliability runbook.