Incident Response Playbooks¶

One playbook per class of incident. Every playbook follows the same shape: Detect → Triage → Mitigate → RCA template.

Pair with runbook.md for day-2 ops and observability.md for the underlying metrics.

Severity definitions:

Sev	Impact	Response
SEV1	Money is moving incorrectly, or player experience is broken for > 5 % of sessions	Page oncall immediately; war-room until mitigated
SEV2	Single operator affected; money risk quantifiable and bounded	Oncall during business hours; war-room if unresolved > 4 h
SEV3	Degraded observability or non-player-impacting error rate	Ticket, triage next business day

Use the oldest applicable playbook; if nothing fits, page the engineering lead and file a new playbook post-RCA.

1. Stuck rollbacks ( > 5 min in `PendingWalletJob` )¶

Severity: SEV2 per operator; SEV1 if > 1 % of pending jobs cross-cutting.

Detect¶

Alert: pending_wallet_jobs{endpoint="ROLLBACK"} > 0 for > 5m.
Dashboard: the "Rollback queue depth" panel in the SLO dashboard.
Operator complaints: players reporting that their stake was debited but the round voided / refund never arrived.

Triage¶

Query: SELECT operator_id, count(*), max(age(now(), created_at)) FROM pending_wallet_jobs WHERE endpoint='ROLLBACK' AND completed_at IS NULL GROUP BY 1 ORDER BY 2 DESC;
Is the queue growing or stable? SELECT date_trunc('minute', created_at), count(*) FROM pending_wallet_jobs WHERE endpoint='ROLLBACK' AND completed_at IS NULL GROUP BY 1 ORDER BY 1 DESC LIMIT 30;
What's the last_error? SELECT last_error, count(*) FROM pending_wallet_jobs WHERE endpoint='ROLLBACK' AND completed_at IS NULL GROUP BY 1 ORDER BY 2 DESC LIMIT 10;

Mitigate¶

Common root cause: operator wallet is rejecting the rollback. Check their WalletCall error rows: SELECT created_at, status, response_body FROM wallet_calls WHERE operator_id=<X> AND endpoint='ROLLBACK' AND succeeded=false ORDER BY created_at DESC LIMIT 20;
If RS_ERROR_TRANSACTION_DOES_NOT_EXIST: operator has forgotten the original bet. Mark the pending job completed with a manual override (the stake never actually left their wallet).
If RS_ERROR_TRANSACTION_ALREADY_ROLLED_BACK / duplicate: same. Mark completed.
If 5xx or timeout: the operator wallet is down. Notify them; keep retrying.
If the RGS circuit breaker has opened for that operator: see §6.
If it's all operators at once: likely a bug in the retry path. Revert to prior version, file SEV1.

RCA template¶

What was the operator's failure mode?
Did our classifier treat an unknown response as retryable correctly?
How long did it take to detect? (Alert latency + human response.)
What monitor would reduce detection time?

2. RTP drift alert¶

Severity: SEV2 on 24h window > ±3σ; SEV1 if rolling 1h > ±5σ.

Detect¶

Alert: rtp_actual_rolling_24h{game_code="ketapola-dice"} - <theoretical> ≥ ±3σ (see observability.md). (Matching labels fire for other game_code values as new plugins ship.)

Triage¶

Reproduce: run the per-game 10M-round regression. bun test tests/games/<code>/rtp-regression.spec.ts (e.g. tests/games/ketapola-dice/rtp-regression.spec.ts). Does it still pass at the current config?
Check config audit: SELECT * FROM operator_config_audit_log WHERE field_name IN ('lowWeight','highWeight','commissionMicro') AND created_at > now() - interval '24h' ORDER BY created_at DESC;
Check round counts per operator, a single outlier operator can move the mean.
Sample rounds: pull 50 random rounds in the alert window, re-verify the proof end-to-end (the public-facing verifier). If any fail: stop everything, page SEV1.

Mitigate¶

Weight config changed unexpectedly: revert the config via portal (audit log shows who). Verify the engine picks up the new weights on the next round.
Single outlier operator: quarantine them by setting OperatorGameConfig.enabled = false. Their in-flight rounds void; bets currently held refund.
RNG behaviour actually drifted: this is SEV1. Roll back the deploy. Any deploy that touched rng.ts without a CERT-ATTEST is the CI gate failing, investigate how it reached prod.

RCA template¶

Root: config change, operator-induced, statistical variance, or RNG bug?
Effective RTP observed; player impact in micro-units.
Was any player overpaid or underpaid? Manual reconciliation required?

3. Operator credential leak (suspected or confirmed)¶

Severity: SEV1.

Detect¶

Operator notifies us of a leak.
Anomalous request pattern (requests from new IP; signed with old kid thought to be retired; high request volume).
Public disclosure (HIBP, paste site, exposed repo).

Triage¶

Identify the credential: SELECT id, kid, created_at, revoked_at FROM operator_credentials WHERE operator_id=<X> AND kid=<Y>;
Has the credential signed anything in the last 24h? SELECT created_at, count(*) FROM wallet_calls WHERE operator_id=<X> AND created_at > now() - interval '24h' GROUP BY 1 ORDER BY 1 DESC;
Is there evidence of misuse? Anomalous session creation pattern? Unexplained InboundIdempotency rows?

Mitigate¶

Immediate revocation: UPDATE operator_credentials SET revoked_at=now() WHERE id=<credential_id>;: or via the portal.
Issue a replacement: coordinate the new (kid, secret) delivery through a secure channel (1Password / Vault / PGP). See runbook.md §6.1.
Audit: scan the last 7 days of WalletCall and InboundIdempotency for rows signed by the leaked credential. Any request that wouldn't match a known-good session pattern is a finding.
Notify the operator's compliance contact. Every operator has one; if not, this is also an RCA finding.
Alert players only if confirmed misuse with player impact.

RCA template¶

Source of leak (internal, external, third-party).
Dwell time (first misuse → revocation).
Any transactions to roll back?
Did MFA / IP allow-list have prevented this? (Gap for future mitigations.)

4. Clock skew spike ( > 10s sustained )¶

Severity: SEV2 for the affected operator.

Detect¶

Alert: signature_rejection_reason="clock_skew_outside_window" count rising.
Operator reports intermittent 401s on otherwise-valid requests.

Triage¶

Whose clock? Compare the Timestamp header in rejected requests against our server clock. Is the drift one-sided (operator ahead/behind) or bidirectional?
Is it a specific host or all? Many operators run multiple pods.
Is our own clock drifting? Check host NTP status.

Mitigate¶

Operator-side drift: notify their ops; recommend Chrony or native cloud NTP. Short-term, widen SIGNATURE_WINDOW_SECONDS per-env (not per-operator; the env var is global). Do not exceed 60s.
Our drift: restart ntpd/chronyd on the affected host; page infra.
Clock tampering by an attacker: if drift is attacker-influenced (one request far in the past/future), treat as replay/spoof attempt; see §3.

RCA template¶

Source of drift (upstream NTP, hypervisor, attacker).
Duration.
Any money-path rejections? Did they retry successfully?

5. Circuit breaker tripped for an operator¶

Severity: SEV2 for that operator.

Detect¶

Log: circuit_breaker_opened for operatorId=<X>.
Metric: circuit_breaker_state{operator="<X>"} = 1.
Symptom: new bets for that operator are immediately rejected.

Triage¶

Why did it open? Check the last N failed WalletCall rows for the operator: SELECT created_at, endpoint, response_status, response_body FROM wallet_calls WHERE operator_id=<X> AND succeeded=false ORDER BY created_at DESC LIMIT 20;
Is the operator wallet up? Try a synthetic balance call manually.
Is it a specific endpoint failing (/win only) or all?

Mitigate¶

Operator wallet is actually down: leave the breaker open. Notify them. When they're healthy, the breaker auto-closes after the cooldown + successful probe.
Our retry classifier is mis-treating their responses: if a valid operator-side code is being classified as a failure, that's a bug , patch the classifier and ship a hotfix per change-management.md §4.
False-positive flap: adjust the CircuitBreaker thresholds in the config (this is a regated change, see change-management).

RCA template¶

Dwell time (first failure → breaker open → first recovery).
Number of bets rejected during the outage.
Was the operator able to detect the downtime from their side?

6. Pending job queue growth (non-rollback)¶

Severity: SEV2 if queue depth > 1000 for > 10 min.

Detect¶

Alert: pending_wallet_jobs > 1000 for > 10m.
Dashboard: queue depth panel.

Triage¶

SELECT endpoint, operator_id, count(*) FROM pending_wallet_jobs WHERE completed_at IS NULL GROUP BY 1,2 ORDER BY 3 DESC LIMIT 20;
Are attempts incrementing? If not, the runner is stopped. Check process health.
Is any single operator swamping the queue?

Mitigate¶

Runner stopped: restart the RGS. Graceful shutdown drains the queue; forced kill leaves jobs for the next process (safe, idempotent).
Single-operator swamp: treat as operator outage (§5); circuit breaker should kick in and prevent new inflow.
Genuine backlog: scale horizontally (add rgs-server pods) or temporarily increase PendingJobRunner concurrency.

RCA template¶

Depth-over-time curve.
Drain rate when mitigation applied.
What monitor would catch this earlier?

7. DB connection exhaustion¶

Severity: SEV1 (rgs-server becomes unavailable).

Detect¶

Alert: prisma_pool_wait_time_ms p99 > 500.
Logs: Error: getaddrinfo ENOTFOUND / connection pool timeout.
/readyz returns 503 across all pods.

Triage¶

DB-side: SELECT count(*) FROM pg_stat_activity WHERE datname='yantra';: is it at max_connections?
SELECT state, count(*) FROM pg_stat_activity WHERE datname='yantra' GROUP BY 1;: idle-in-transaction?
App-side: are any pods stuck holding connections (infinite loop, slow query)?

Mitigate¶

Short-term: restart the RGS pods (drops connections, forces reconnect with a fresh pool).
Medium: raise Postgres max_connections or put PgBouncer in front.
Long: identify the query causing idle-in-transaction. Prisma logs at debug level.

RCA template¶

Saturation curve (connection count over time).
Specific query that held transactions open.
Whether PgBouncer would have absorbed it.

8. Reconciliation mismatch ( MAJOR_DRIFT status )¶

Severity: SEV2 if one operator; SEV1 if multiple.

Detect¶

Daily reconcile cron exits with code 2 (MAJOR_DRIFT).
Operator reports a settlement discrepancy.

Triage¶

Run the reconcile CLI manually with verbose output: bun scripts/reconcile.ts --file settlement.csv.
Which direction does the drift go? (Ours high → we processed txns they didn't ack. Ours low → they sent settlements we didn't process.)
Compare row-by-row: SELECT transaction_uuid, amount_micro, created_at, succeeded FROM wallet_calls WHERE operator_id=<X> AND currency=<Y> AND created_at >= '<D>' AND created_at < '<D+1>' ORDER BY created_at; and diff against the operator's file.

Mitigate¶

Ours high by N transactions: those are probably stuck in pending_wallet_jobs (not yet settled on their side). Cross-check.
Ours low: we may have missed settlements, look for WalletCall rows with succeeded=false we did not retry.
If drift can't be explained: halt new settlements for that operator, escalate to their compliance contact.

RCA template¶

Drift value in micro-units and percentage.
Attribution: which transactions account for the drift?
Whether they'll reconcile naturally on the next day's close or need a manual correction entry.

9. RNG change landed without re-cert¶

Severity: SEV1.

Detect¶

Post-deploy: RTP regression test fails.
rng-change-gate CI was somehow bypassed (manual force-merge, admin override).
Test-vectors spec fails on a freshly-deployed build.

Mitigate¶

Immediate: roll back the deploy. Every second of uptime at a post-change RTP is a certification exposure.
Freeze further deploys until root cause is identified.
Notify compliance / legal, a re-cert conversation may be necessary.

RCA template¶

How did the change reach main?
What gate failed? (Reviewer asleep? Admin override? CI bypass?)
How to prevent: required status checks, branch protection, CODEOWNERS.

10. Operator wallet down > 10 min¶

Severity: SEV2 per operator.

Detect¶

Circuit breaker open (§5).
wallet_call_latency_ms{endpoint="bet", operator="<X>"} p99 → timeout.

Triage¶

Synthetic /wallet/balance from our side: does it succeed?
Check operator's status page if they have one.
How many bets currently rejected? How many winners pending?

Mitigate¶

Keep circuit breaker open, no new bets for that operator while down.
Pending winners accumulate in pending_wallet_jobs; they will drain on recovery.
Communicate: post a status update to the operator-portal's incident banner so support can forward to players.
If downtime crosses 1h and the operator has not acknowledged, escalate to their compliance contact.

RCA template¶

Duration of outage.
Queue depth at reopen.
Drain time.
Any bets void-refunded due to the outage?

11. Post-incident review¶

Every SEV1 and every unexpected SEV2 get a written post-mortem within 5 business days. Template:

Title: <system> <incident>, <date>
Severity: SEV<N>
Detected at: <UTC>
Mitigated at: <UTC>
Duration: <hh:mm>

Summary:
  Two-sentence synopsis.

Impact:
  - N players affected
  - M micro-units of stakes in uncertain state
  - $ value if quantifiable

Timeline:
  HH:MM UTC, event
  HH:MM UTC, detection
  HH:MM UTC, ...
  HH:MM UTC, resolved

Root cause:
  Where the system failed, and why. No blame, focus on mechanisms.

What worked:
  Monitors / runbooks / people that did their job.

What did not work:
  Gaps, confusion, mis-alerts, missing runbook entries.

Follow-up:
  - [ ] Action item (owner, due date)
  - [ ] Action item
  - [ ] Action item

Artifacts:
  - Link to SEV ticket
  - Link to any code changes
  - Link to any updated monitor / alert / runbook section

If the incident touched a regated area (RNG, wallet, schema), the post-mortem is also shared with the compliance owner per change-management.md.

12. Playbook version¶

Field	Value
Playbook version	1.0.0
Last reviewed	2026-04-23
Next review trigger	After any incident not in the list above

Incident Response Playbooks¶

1. Stuck rollbacks ( > 5 min in PendingWalletJob )¶

Detect¶

Triage¶

Mitigate¶

RCA template¶

2. RTP drift alert¶

Detect¶

Triage¶

Mitigate¶

RCA template¶

3. Operator credential leak (suspected or confirmed)¶

Detect¶

Triage¶

Mitigate¶

RCA template¶

4. Clock skew spike ( > 10s sustained )¶

Detect¶

Triage¶

Mitigate¶

RCA template¶

5. Circuit breaker tripped for an operator¶

Detect¶

Triage¶

Mitigate¶

RCA template¶

6. Pending job queue growth (non-rollback)¶

Detect¶

Triage¶

Mitigate¶

RCA template¶

7. DB connection exhaustion¶

Detect¶

Triage¶

Mitigate¶

RCA template¶

8. Reconciliation mismatch ( MAJOR_DRIFT status )¶

Detect¶

Triage¶

Mitigate¶

RCA template¶

9. RNG change landed without re-cert¶

Detect¶

Mitigate¶

RCA template¶

10. Operator wallet down > 10 min¶

Detect¶

Triage¶

Mitigate¶

RCA template¶

11. Post-incident review¶

12. Playbook version¶

1. Stuck rollbacks ( > 5 min in `PendingWalletJob` )¶