PaaS Incident Response Runbook
Last updated: 2026-02-17Scope: operational response for
si paas incidents
Severity Model
| Severity | Definition | Example |
|---|---|---|
| Sev-1 | Active production outage or data-loss risk | all targets failed deploy/rollback, repeated health failures with no known-good rollback |
| Sev-2 | Partial degradation with service impact | one target unhealthy, alert routing broken, webhook ingest failing |
| Sev-3 | Non-critical defect or tooling regression | stale events, warning-only drift reports, documentation mismatch |
Response Workflow
- Detect and classify:
- Check
si paas events list --jsonfor recentfailedorcriticalrecords. - Review
si paas alert history --jsonfor channel delivery and callback hints.
- Check
- Stabilize:
- Halt risky rollouts (
--continue-on-error=false, avoid parallel fanout during incident). - If needed, execute controlled rollback:
si paas rollback --app <app> --apply.
- Halt risky rollouts (
- Diagnose:
- Inspect per-target logs:
si paas logs --app <app> --target <id> --tail 400. - Reconcile runtime drift:
si paas deploy reconcile --app <app> --json. - Validate target/runtime baseline:
si paas target check --all --json.
- Inspect per-target logs:
- Recover:
- Re-deploy fixed release or rollback to known-good release.
- Acknowledge incident action trail:
si paas alert acknowledge --id <alert_id>.
- Post-incident:
- Capture root cause, timeline, and follow-up items.
- Add or update failure drill coverage if new failure mode was discovered.
Scenario Playbooks
Deploy failure (PAAS_REMOTE_*, PAAS_HEALTH_CHECK_FAILED)
- Run
si paas deploy reconcile --app <app> --jsonto identify target states. - If health-gated deploy failed, execute rollback:
si paas rollback --app <app> --target <id|all> --apply --json
- Validate recovered runtime:
si paas target check --all --jsonsi paas logs --app <app> --target <id> --tail 200
Blue/green cutover failure
- Inspect active/previous slots from deploy output fields.
- Confirm rollback status in deploy failure envelope (
rolled_back_targets,target_statuses). - Re-run deploy only after health root cause is fixed:
si paas deploy bluegreen --app <app> --target <id> --apply --json
Webhook trigger failure
- Validate webhook signature and mapping:
si paas deploy webhook map list --json
- Reproduce ingest with recorded payload/signature if available.
- If auth mismatch, rotate secret and update sender config.
Vault trust/secret guardrail failure
- Validate trust recipients:
si vault check --file <vault_file>
- Re-establish trust:
si vault trust accept ...
- Retry deploy without unsafe override unless incident policy explicitly allows.
Incident Artifact Checklist
- Timestamped command transcript.
- Relevant
events listandalert historypayload snapshots. - Target log excerpts used for diagnosis.
- Final remediation action and verification command outputs.
- Follow-up ticket IDs for preventive hardening.
