Operations Runbook

[Intention: Amanah]

This is the production operations runbook for BinKindly.

It exists to close the gap between:

a pipeline that works
a pipeline that can be operated truthfully under failure

Canonical operational anchors:

Operating principles

Fail closed when certainty drops.
Preserve last known good before attempting repair.
Prefer bounded rollback over speculative forward repair.
Every intervention must leave evidence.
Manual rescue does not count as maturity.

Incident classes

1. Source outage

Definition:

official upstream source becomes unavailable, malformed, blocked, or stale enough to prevent trustworthy refresh

Expected effect:

refresh blocks or remains on last known good

2. Publish failure

Definition:

candidate pack cannot be published because verifier, attestation, write, or promotion step fails

Expected effect:

publish_blocked
last known good remains active

3. Stale-pack threshold breach

Definition:

active locality or baseline manifest is older than the declared freshness threshold for that scope

Expected effect:

locality is treated as degraded operationally
product truth may require caution or refusal

4. Fallback-path spike

Definition:

sorter fallback, deterministic fallback, or rescue path usage rises materially above normal baseline

Expected effect:

trust weakens even if publishes still succeed

5. Bad publish

Definition:

a newly promoted pack is wrong, unsafe, malformed, incomplete, or operationally inconsistent despite passing earlier gates

Expected effect:

immediate scope containment
rollback to last known good

6. AI lane instability

Definition:

Workers AI path becomes unreliable, malformed, too slow, or causes workflow instability

Expected effect:

AI-assisted sorting may need to be disabled or narrowed without breaking truthful service

Severity model

SEV-1

Use when:

active published truth is wrong for users
rollback path is failing
manifest resolution is broadly unavailable

Required response:

contain immediately
activate kill switch or rollback
preserve evidence
suspend risky forward change

SEV-2

Use when:

refresh pipeline is degraded
stale-pack threshold is breached for critical scope
fallback/rescue path spikes materially

Required response:

halt expansion
investigate within same operating window
decide contain vs repair

SEV-3

Use when:

non-critical scope is degraded
early warning thresholds are crossed
there is no current user-facing wrong active truth

Required response:

record
investigate
do not ignore repeated recurrence

Alert matrix

Alert	Trigger	Severity	First action
Publish blocked	repeated `publish_blocked` or verifier/publisher hard failure	SEV-2	inspect failed gate and preserve active last-known-good
Stale pack	active manifest age exceeds threshold for supported scope	SEV-2	pause confidence claims for affected scope and investigate refresh path
Fallback spike	fallback or rescue path exceeds normal operational baseline	SEV-2	halt lane graduation and inspect locality/country policy state
Resolver health	manifest resolution or rules API path fails probes	SEV-1 or SEV-2	validate publisher read path and rebind/rollback if needed
Workflow watchdog	orchestrator harvest/discovery/onboarding watchdog breach	SEV-2	inspect active run, lock state, and downstream backlog
Kill-switch activation	AI or bad-pack kill switch enabled	SEV-2	confirm containment, scope, and follow-up evidence
Manual trigger dedupe spike	abnormal burst or operator contention pattern	SEV-3	inspect duplicate ingress cause and active lock ownership

Roles

On-call operator

Responsible for:

triage
containment
evidence capture
rollback or kill-switch activation

Incident owner

Responsible for:

decision log
repair path
closure criteria
post-incident record

Release owner

Responsible for:

launch-day validation
production config verification
checksum archive and sign-off evidence

Core evidence requirements

Every meaningful incident or intervention must capture:

timestamp
scope
trigger or alert source
observed failure mode
containment action
rollback or kill-switch action if used
verification result
operator or owner
follow-up requirement

Runbook: source outage

Use when:

official source becomes unavailable, blocked, empty, malformed, or stale

Steps:

Confirm affected locality or country scope.
Check whether last known good remains active.
Confirm whether the issue is:
- upstream outage
- source format drift
- admission/probeability drift
- temporary rate-limit or anti-bot posture
If active truth is still safe:
- keep last known good active
- do not force publish
If supported-scope freshness promise is now at risk:
- mark incident severity
- suspend maturity or expansion decisions for that scope
Record evidence and next review time.

Closure condition:

upstream source recovers or scope is explicitly downgraded truthfully

Runbook: publish failure

Use when:

verifier or publisher blocks promotion

Steps:

Identify failing stage:
- verifier gate
- attestation
- R2 write
- manifest promotion
- canary readback
Confirm active manifest still points to valid last known good.
Do not bypass publish gates.
If failure is isolated and active truth remains safe:
- leave active manifest unchanged
- investigate cause
If repeated failures suggest unstable lane behavior:
- pause graduation for affected scope
- consider AI kill switch or scope narrowing
Capture audit evidence and operator notes.

Closure condition:

publish path passes again without weakening gates

Runbook: stale-pack breach

Use when:

active manifest age or source freshness exceeds declared threshold

Steps:

Confirm affected scope and threshold breached.
Check whether refresh is blocked, delayed, or silently not due.
Determine whether the stale state is:
- localized
- country-wide
- pipeline-wide
If freshness claim is no longer defensible:
- treat scope as degraded operationally
- stop any expansion or graduation decisions relying on that scope
Restore refresh path or adjust public truth claims before closure.

Closure condition:

fresh publish succeeds and active scope returns within threshold

Runbook: fallback-path spike

Use when:

fallback or rescue path rate rises materially

Steps:

Confirm whether spike is:
- locality-specific
- country-lane specific
- model-wide
Check runtime policy state:
- sorter_mode
- risk_state
- allowed_error_codes
- country-lane graduation counters
Stop treating affected scope as mature.
If AI instability is involved:
- narrow scope or activate AI kill switch
If source quality drift is involved:
- treat as source-quality incident, not model win
Preserve evidence for lane-governance review.

Closure condition:

scheduled clean primary success returns to normal thresholds

Runbook: AI kill switch

Purpose:

contain model-lane instability without breaking truthful service

Use when:

malformed AI output spikes
AI invocation instability threatens workflow completion
fallback/rescue path becomes dominant
newly widened lane causes publish-path regression

Expected effect:

AI-assisted or widened AI lanes are disabled or narrowed
service remains fail-closed or last-known-good

Actions:

Freeze graduation decisions for affected scope.
Disable or narrow the relevant AI lane.
Confirm no unsupported silent degrade has been introduced.
Run targeted validation on a representative locality set.
Record:
- reason
- scope
- activation time
- validation result

Closure condition:

lane is re-enabled only after clean scheduled evidence and rollback confidence

Canonical AI kill-switch evidence path:

/Users/mturous/Muzinezz/Projects/Apple/BinKindly/reports/operations/closure-evidence/02-ai-kill-switch-activation/
generate template record with npm run ops:ai-kill-switch-record -- --scope=locality --scope-id=<locality_id> --validation-localities=<comma-separated-localities>

Runbook: bad-pack emergency kill switch

Purpose:

contain a wrong active publish fast

Use when:

active pack is wrong, unsafe, or not defensible

Primary actions:

Roll back active manifest to last known good pack.
If needed, disable affected locality overlay and serve baseline-only where truthful.
Confirm pack checksum and manifest integrity.
Probe manifest resolution and pack read path after containment.
Preserve incident evidence before attempting forward repair.

Closure condition:

active user-facing truth is restored to safe state within objective

Canonical bad-pack emergency evidence path:

/Users/mturous/Muzinezz/Projects/Apple/BinKindly/reports/operations/closure-evidence/03-bad-pack-kill-switch-manifest-rebind/
generate template record with npm run ops:bad-pack-rebind-record -- --scope=locality --scope-id=<locality_id>

Rollback objective

Rollback target:

restore a safe active manifest state within 15 minutes of confirmed bad publish
confirm manifest and pack integrity within 30 minutes
complete operator evidence record within the same incident window

This is the current operational objective. It is not considered proven until drill evidence exists.

Canonical rollback evidence path:

/Users/mturous/Muzinezz/Projects/Apple/BinKindly/reports/operations/closure-evidence/01-rollback-drill/
generate template record with npm run ops:rollback-record -- --scope=locality --scope-id=<locality_id>

Rollback checklist

Step	Action	Evidence
1	Identify affected locality or baseline scope	incident record
2	Confirm current active manifest and last known good target	manifest probe
3	Rebind or restore active manifest to last known good	publisher response and audit event
4	Confirm manifest checksum and pack checksum alignment	readback verification
5	Probe resolver and rules API for affected scope	post-rollback probe log
6	Capture timestamps and outcome against rollback objective	drill or incident record

Launch-day production validation

Launch day is not complete until all pass:

final release build checksum archived
production config verified:
- API endpoints
- auth keys
- environment flags
- kill-switch defaults
first publish job completes and is validated
live app fetch resolves manifest and pack correctly
sign-off captured:
- product
- engineering
- QA

Launch-day validation sequence

Verify production config snapshot.
Trigger or observe first production publish.
Probe:
- active manifest
- active pack
- resolver endpoint
- rules endpoint where applicable
Verify live app fetch on target build.
Archive evidence.

Canonical launch-day validation evidence path:

/Users/mturous/Muzinezz/Projects/Apple/BinKindly/reports/operations/closure-evidence/04-launch-day-production-validation/
generate template record with npm run ops:launch-day-record -- --scope=locality --scope-id=<locality_id> --app-build=<build-id>

No-go rules

Do not proceed with launch or expansion if any are true:

rollback objective is undocumented or unproven
kill switches are undefined or untested
stale-pack alerting is absent
publish failures are not routed visibly
active supported scope cannot be validated in production

Whisper-audit wording

[Intention: Amanah] Publish remains paused for this scope until last-known-good safety, rollback evidence, and truth checks are restored.