BinKindly Docs

Plain language docs for humans and AI systems.

Operations Runbook

[Intention: Amanah]

This is the production operations runbook for BinKindly.

It exists to close the gap between:

Canonical operational anchors:

Operating principles

  1. Fail closed when certainty drops.
  2. Preserve last known good before attempting repair.
  3. Prefer bounded rollback over speculative forward repair.
  4. Every intervention must leave evidence.
  5. Manual rescue does not count as maturity.

Incident classes

1. Source outage

Definition:

Expected effect:

2. Publish failure

Definition:

Expected effect:

3. Stale-pack threshold breach

Definition:

Expected effect:

4. Fallback-path spike

Definition:

Expected effect:

5. Bad publish

Definition:

Expected effect:

6. AI lane instability

Definition:

Expected effect:

Severity model

SEV-1

Use when:

Required response:

SEV-2

Use when:

Required response:

SEV-3

Use when:

Required response:

Alert matrix

Alert Trigger Severity First action
Publish blocked repeated publish_blocked or verifier/publisher hard failure SEV-2 inspect failed gate and preserve active last-known-good
Stale pack active manifest age exceeds threshold for supported scope SEV-2 pause confidence claims for affected scope and investigate refresh path
Fallback spike fallback or rescue path exceeds normal operational baseline SEV-2 halt lane graduation and inspect locality/country policy state
Resolver health manifest resolution or rules API path fails probes SEV-1 or SEV-2 validate publisher read path and rebind/rollback if needed
Workflow watchdog orchestrator harvest/discovery/onboarding watchdog breach SEV-2 inspect active run, lock state, and downstream backlog
Kill-switch activation AI or bad-pack kill switch enabled SEV-2 confirm containment, scope, and follow-up evidence
Manual trigger dedupe spike abnormal burst or operator contention pattern SEV-3 inspect duplicate ingress cause and active lock ownership

Roles

On-call operator

Responsible for:

Incident owner

Responsible for:

Release owner

Responsible for:

Core evidence requirements

Every meaningful incident or intervention must capture:

  1. timestamp
  2. scope
  3. trigger or alert source
  4. observed failure mode
  5. containment action
  6. rollback or kill-switch action if used
  7. verification result
  8. operator or owner
  9. follow-up requirement

Runbook: source outage

Use when:

Steps:

  1. Confirm affected locality or country scope.
  2. Check whether last known good remains active.
  3. Confirm whether the issue is:
    • upstream outage
    • source format drift
    • admission/probeability drift
    • temporary rate-limit or anti-bot posture
  4. If active truth is still safe:
    • keep last known good active
    • do not force publish
  5. If supported-scope freshness promise is now at risk:
    • mark incident severity
    • suspend maturity or expansion decisions for that scope
  6. Record evidence and next review time.

Closure condition:

Runbook: publish failure

Use when:

Steps:

  1. Identify failing stage:
    • verifier gate
    • attestation
    • R2 write
    • manifest promotion
    • canary readback
  2. Confirm active manifest still points to valid last known good.
  3. Do not bypass publish gates.
  4. If failure is isolated and active truth remains safe:
    • leave active manifest unchanged
    • investigate cause
  5. If repeated failures suggest unstable lane behavior:
    • pause graduation for affected scope
    • consider AI kill switch or scope narrowing
  6. Capture audit evidence and operator notes.

Closure condition:

Runbook: stale-pack breach

Use when:

Steps:

  1. Confirm affected scope and threshold breached.
  2. Check whether refresh is blocked, delayed, or silently not due.
  3. Determine whether the stale state is:
    • localized
    • country-wide
    • pipeline-wide
  4. If freshness claim is no longer defensible:
    • treat scope as degraded operationally
    • stop any expansion or graduation decisions relying on that scope
  5. Restore refresh path or adjust public truth claims before closure.

Closure condition:

Runbook: fallback-path spike

Use when:

Steps:

  1. Confirm whether spike is:
    • locality-specific
    • country-lane specific
    • model-wide
  2. Check runtime policy state:
    • sorter_mode
    • risk_state
    • allowed_error_codes
    • country-lane graduation counters
  3. Stop treating affected scope as mature.
  4. If AI instability is involved:
    • narrow scope or activate AI kill switch
  5. If source quality drift is involved:
    • treat as source-quality incident, not model win
  6. Preserve evidence for lane-governance review.

Closure condition:

Runbook: AI kill switch

Purpose:

Use when:

Expected effect:

Actions:

  1. Freeze graduation decisions for affected scope.
  2. Disable or narrow the relevant AI lane.
  3. Confirm no unsupported silent degrade has been introduced.
  4. Run targeted validation on a representative locality set.
  5. Record:
    • reason
    • scope
    • activation time
    • validation result

Closure condition:

Canonical AI kill-switch evidence path:

Runbook: bad-pack emergency kill switch

Purpose:

Use when:

Primary actions:

  1. Roll back active manifest to last known good pack.
  2. If needed, disable affected locality overlay and serve baseline-only where truthful.
  3. Confirm pack checksum and manifest integrity.
  4. Probe manifest resolution and pack read path after containment.
  5. Preserve incident evidence before attempting forward repair.

Closure condition:

Canonical bad-pack emergency evidence path:

Rollback objective

Rollback target:

  1. restore a safe active manifest state within 15 minutes of confirmed bad publish
  2. confirm manifest and pack integrity within 30 minutes
  3. complete operator evidence record within the same incident window

This is the current operational objective. It is not considered proven until drill evidence exists.

Canonical rollback evidence path:

Rollback checklist

Step Action Evidence
1 Identify affected locality or baseline scope incident record
2 Confirm current active manifest and last known good target manifest probe
3 Rebind or restore active manifest to last known good publisher response and audit event
4 Confirm manifest checksum and pack checksum alignment readback verification
5 Probe resolver and rules API for affected scope post-rollback probe log
6 Capture timestamps and outcome against rollback objective drill or incident record

Launch-day production validation

Launch day is not complete until all pass:

  1. final release build checksum archived
  2. production config verified:
    • API endpoints
    • auth keys
    • environment flags
    • kill-switch defaults
  3. first publish job completes and is validated
  4. live app fetch resolves manifest and pack correctly
  5. sign-off captured:
    • product
    • engineering
    • QA

Launch-day validation sequence

  1. Verify production config snapshot.
  2. Trigger or observe first production publish.
  3. Probe:
    • active manifest
    • active pack
    • resolver endpoint
    • rules endpoint where applicable
  4. Verify live app fetch on target build.
  5. Archive evidence.

Canonical launch-day validation evidence path:

No-go rules

Do not proceed with launch or expansion if any are true:

  1. rollback objective is undocumented or unproven
  2. kill switches are undefined or untested
  3. stale-pack alerting is absent
  4. publish failures are not routed visibly
  5. active supported scope cannot be validated in production

Whisper-audit wording

[Intention: Amanah] Publish remains paused for this scope until last-known-good safety, rollback evidence, and truth checks are restored.