Operations Runbook
[Intention: Amanah]
This is the production operations runbook for BinKindly.
It exists to close the gap between:
- a pipeline that works
- a pipeline that can be operated truthfully under failure
Canonical operational anchors:
Operating principles
- Fail closed when certainty drops.
- Preserve last known good before attempting repair.
- Prefer bounded rollback over speculative forward repair.
- Every intervention must leave evidence.
- Manual rescue does not count as maturity.
Incident classes
1. Source outage
Definition:
- official upstream source becomes unavailable, malformed, blocked, or stale enough to prevent trustworthy refresh
Expected effect:
- refresh blocks or remains on last known good
2. Publish failure
Definition:
- candidate pack cannot be published because verifier, attestation, write, or promotion step fails
Expected effect:
publish_blocked- last known good remains active
3. Stale-pack threshold breach
Definition:
- active locality or baseline manifest is older than the declared freshness threshold for that scope
Expected effect:
- locality is treated as degraded operationally
- product truth may require caution or refusal
4. Fallback-path spike
Definition:
- sorter fallback, deterministic fallback, or rescue path usage rises materially above normal baseline
Expected effect:
- trust weakens even if publishes still succeed
5. Bad publish
Definition:
- a newly promoted pack is wrong, unsafe, malformed, incomplete, or operationally inconsistent despite passing earlier gates
Expected effect:
- immediate scope containment
- rollback to last known good
6. AI lane instability
Definition:
- Workers AI path becomes unreliable, malformed, too slow, or causes workflow instability
Expected effect:
- AI-assisted sorting may need to be disabled or narrowed without breaking truthful service
Severity model
SEV-1
Use when:
- active published truth is wrong for users
- rollback path is failing
- manifest resolution is broadly unavailable
Required response:
- contain immediately
- activate kill switch or rollback
- preserve evidence
- suspend risky forward change
SEV-2
Use when:
- refresh pipeline is degraded
- stale-pack threshold is breached for critical scope
- fallback/rescue path spikes materially
Required response:
- halt expansion
- investigate within same operating window
- decide contain vs repair
SEV-3
Use when:
- non-critical scope is degraded
- early warning thresholds are crossed
- there is no current user-facing wrong active truth
Required response:
- record
- investigate
- do not ignore repeated recurrence
Alert matrix
| Alert | Trigger | Severity | First action |
|---|---|---|---|
| Publish blocked | repeated publish_blocked or verifier/publisher hard failure |
SEV-2 | inspect failed gate and preserve active last-known-good |
| Stale pack | active manifest age exceeds threshold for supported scope | SEV-2 | pause confidence claims for affected scope and investigate refresh path |
| Fallback spike | fallback or rescue path exceeds normal operational baseline | SEV-2 | halt lane graduation and inspect locality/country policy state |
| Resolver health | manifest resolution or rules API path fails probes | SEV-1 or SEV-2 | validate publisher read path and rebind/rollback if needed |
| Workflow watchdog | orchestrator harvest/discovery/onboarding watchdog breach | SEV-2 | inspect active run, lock state, and downstream backlog |
| Kill-switch activation | AI or bad-pack kill switch enabled | SEV-2 | confirm containment, scope, and follow-up evidence |
| Manual trigger dedupe spike | abnormal burst or operator contention pattern | SEV-3 | inspect duplicate ingress cause and active lock ownership |
Roles
On-call operator
Responsible for:
- triage
- containment
- evidence capture
- rollback or kill-switch activation
Incident owner
Responsible for:
- decision log
- repair path
- closure criteria
- post-incident record
Release owner
Responsible for:
- launch-day validation
- production config verification
- checksum archive and sign-off evidence
Core evidence requirements
Every meaningful incident or intervention must capture:
- timestamp
- scope
- trigger or alert source
- observed failure mode
- containment action
- rollback or kill-switch action if used
- verification result
- operator or owner
- follow-up requirement
Runbook: source outage
Use when:
- official source becomes unavailable, blocked, empty, malformed, or stale
Steps:
- Confirm affected locality or country scope.
- Check whether last known good remains active.
- Confirm whether the issue is:
- upstream outage
- source format drift
- admission/probeability drift
- temporary rate-limit or anti-bot posture
- If active truth is still safe:
- keep last known good active
- do not force publish
- If supported-scope freshness promise is now at risk:
- mark incident severity
- suspend maturity or expansion decisions for that scope
- Record evidence and next review time.
Closure condition:
- upstream source recovers or scope is explicitly downgraded truthfully
Runbook: publish failure
Use when:
- verifier or publisher blocks promotion
Steps:
- Identify failing stage:
- verifier gate
- attestation
- R2 write
- manifest promotion
- canary readback
- Confirm active manifest still points to valid last known good.
- Do not bypass publish gates.
- If failure is isolated and active truth remains safe:
- leave active manifest unchanged
- investigate cause
- If repeated failures suggest unstable lane behavior:
- pause graduation for affected scope
- consider AI kill switch or scope narrowing
- Capture audit evidence and operator notes.
Closure condition:
- publish path passes again without weakening gates
Runbook: stale-pack breach
Use when:
- active manifest age or source freshness exceeds declared threshold
Steps:
- Confirm affected scope and threshold breached.
- Check whether refresh is blocked, delayed, or silently not due.
- Determine whether the stale state is:
- localized
- country-wide
- pipeline-wide
- If freshness claim is no longer defensible:
- treat scope as degraded operationally
- stop any expansion or graduation decisions relying on that scope
- Restore refresh path or adjust public truth claims before closure.
Closure condition:
- fresh publish succeeds and active scope returns within threshold
Runbook: fallback-path spike
Use when:
- fallback or rescue path rate rises materially
Steps:
- Confirm whether spike is:
- locality-specific
- country-lane specific
- model-wide
- Check runtime policy state:
sorter_moderisk_stateallowed_error_codes- country-lane graduation counters
- Stop treating affected scope as mature.
- If AI instability is involved:
- narrow scope or activate AI kill switch
- If source quality drift is involved:
- treat as source-quality incident, not model win
- Preserve evidence for lane-governance review.
Closure condition:
- scheduled clean primary success returns to normal thresholds
Runbook: AI kill switch
Purpose:
- contain model-lane instability without breaking truthful service
Use when:
- malformed AI output spikes
- AI invocation instability threatens workflow completion
- fallback/rescue path becomes dominant
- newly widened lane causes publish-path regression
Expected effect:
- AI-assisted or widened AI lanes are disabled or narrowed
- service remains fail-closed or last-known-good
Actions:
- Freeze graduation decisions for affected scope.
- Disable or narrow the relevant AI lane.
- Confirm no unsupported silent degrade has been introduced.
- Run targeted validation on a representative locality set.
- Record:
- reason
- scope
- activation time
- validation result
Closure condition:
- lane is re-enabled only after clean scheduled evidence and rollback confidence
Canonical AI kill-switch evidence path:
/Users/mturous/Muzinezz/Projects/Apple/BinKindly/reports/operations/closure-evidence/02-ai-kill-switch-activation/- generate template record with
npm run ops:ai-kill-switch-record -- --scope=locality --scope-id=<locality_id> --validation-localities=<comma-separated-localities>
Runbook: bad-pack emergency kill switch
Purpose:
- contain a wrong active publish fast
Use when:
- active pack is wrong, unsafe, or not defensible
Primary actions:
- Roll back active manifest to last known good pack.
- If needed, disable affected locality overlay and serve baseline-only where truthful.
- Confirm pack checksum and manifest integrity.
- Probe manifest resolution and pack read path after containment.
- Preserve incident evidence before attempting forward repair.
Closure condition:
- active user-facing truth is restored to safe state within objective
Canonical bad-pack emergency evidence path:
/Users/mturous/Muzinezz/Projects/Apple/BinKindly/reports/operations/closure-evidence/03-bad-pack-kill-switch-manifest-rebind/- generate template record with
npm run ops:bad-pack-rebind-record -- --scope=locality --scope-id=<locality_id>
Rollback objective
Rollback target:
- restore a safe active manifest state within 15 minutes of confirmed bad publish
- confirm manifest and pack integrity within 30 minutes
- complete operator evidence record within the same incident window
This is the current operational objective. It is not considered proven until drill evidence exists.
Canonical rollback evidence path:
/Users/mturous/Muzinezz/Projects/Apple/BinKindly/reports/operations/closure-evidence/01-rollback-drill/- generate template record with
npm run ops:rollback-record -- --scope=locality --scope-id=<locality_id>
Rollback checklist
| Step | Action | Evidence |
|---|---|---|
| 1 | Identify affected locality or baseline scope | incident record |
| 2 | Confirm current active manifest and last known good target | manifest probe |
| 3 | Rebind or restore active manifest to last known good | publisher response and audit event |
| 4 | Confirm manifest checksum and pack checksum alignment | readback verification |
| 5 | Probe resolver and rules API for affected scope | post-rollback probe log |
| 6 | Capture timestamps and outcome against rollback objective | drill or incident record |
Launch-day production validation
Launch day is not complete until all pass:
- final release build checksum archived
- production config verified:
- API endpoints
- auth keys
- environment flags
- kill-switch defaults
- first publish job completes and is validated
- live app fetch resolves manifest and pack correctly
- sign-off captured:
- product
- engineering
- QA
Launch-day validation sequence
- Verify production config snapshot.
- Trigger or observe first production publish.
- Probe:
- active manifest
- active pack
- resolver endpoint
- rules endpoint where applicable
- Verify live app fetch on target build.
- Archive evidence.
Canonical launch-day validation evidence path:
/Users/mturous/Muzinezz/Projects/Apple/BinKindly/reports/operations/closure-evidence/04-launch-day-production-validation/- generate template record with
npm run ops:launch-day-record -- --scope=locality --scope-id=<locality_id> --app-build=<build-id>
No-go rules
Do not proceed with launch or expansion if any are true:
- rollback objective is undocumented or unproven
- kill switches are undefined or untested
- stale-pack alerting is absent
- publish failures are not routed visibly
- active supported scope cannot be validated in production
Whisper-audit wording
[Intention: Amanah] Publish remains paused for this scope until last-known-good safety, rollback evidence, and truth checks are restored.