Operator Runbook¶
Day-2 operations for Yantra Gaming. What to do on first deploy, what to do when things break, how to rotate keys, how to back up money-path data.
Audience: SRE / ops engineer / infra lead at a licensee.
Companions:
- architecture.md, what the system is
- incidents.md, what to do when it's on fire
- security.md, threat model + published SLOs
- change-management.md, RNG / math change gating
- observability.md, metrics, alerts, RTP drift monitor
1. Deployment topologies¶
1.1 Local dev (reference)¶
docker compose up -d starts Postgres on :5434. bun run dev runs all four
apps. See README.md Quickstart.
1.2 Single-region production¶
Minimum viable prod:
┌──── TLS/HSTS ────┐
Internet ──▶ WAF ──▶ │ Load balancer │ ──▶ rgs-server × N (3+)
└──────────────────┘ │
▼
Postgres 16 (primary + 2 replicas)
│
▼
Daily pg_dump → object storage
Stateful components:
- Postgres 16+ with logical replication to at least one hot replica.
- Persistent volume for pg_data.
- Encrypted object storage for pg_dump output (retention per §5).
Stateless components:
- rgs-server: horizontally scalable, sharded on operatorId if >10k
concurrent sessions. See §1.4 for sharding notes.
- operator-portal, game-client: static builds served from CDN.
1.3 Multi-region¶
Cert-lab jurisdictions (Germany, Malta) require in-country data residency. Pattern:
- Primary region with writer Postgres.
- Per-jurisdiction read replica + local rgs-server cluster.
- Round writes go to the primary; proof reads serve from the nearest replica.
- Session JWTs carry
jurisdiction; the engine honours the claim when selecting which replica to read from.
Full multi-region is on the v1.1 roadmap (see B2B_ROADMAP.md §17); the schema already supports it (no region-local auto-increment keys; every ID is a UUID).
1.4 Horizontal scale¶
A single rgs-server process owns the game-engine loop for one
(operatorId, gameCode, currency) tuple at a time. To scale past one node:
- Shard on
operatorIdhash → assign each shard range to a distinctrgs-serverpod. - Use a leader-lock (advisory lock in Postgres or Redis) to guarantee single-writer per shard.
- Keep socket connections sticky to the owning shard (load balancer:
consistent-hash by
operatorIdfrom the session JWT).
Not implemented in this repo. EngineRegistry.startAllEnabled() currently
starts every enabled engine on boot, which is correct for single-node
deployments and the dev loop.
2. Docker¶
The repo ships docker-compose.yml for Postgres only. A production
Dockerfile for rgs-server is a licensee-specific artefact, but a minimal
reference:
FROM oven/bun:1.3.10-alpine AS build
WORKDIR /app
COPY package.json bun.lock ./
COPY apps/rgs-server ./apps/rgs-server
COPY packages ./packages
RUN bun install --frozen-lockfile
RUN cd apps/rgs-server && bun run build
FROM oven/bun:1.3.10-alpine AS runtime
WORKDIR /app
COPY --from=build /app /app
WORKDIR /app/apps/rgs-server
EXPOSE 4500
HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
CMD wget -qO- http://localhost:4500/readyz || exit 1
CMD ["bun", "dist/index.js"]
Pin Bun to the version in .tool-versions (1.3.10 at time of writing).
Never use oven/bun:latest in a production build.
3. Environment variables¶
Every variable the RGS reads. Source: apps/rgs-server/src/config.ts.
3.1 Required¶
| Var | Type | Purpose | Failure mode |
|---|---|---|---|
DATABASE_URL |
Postgres connection string | DB connection for Prisma | Boot fails |
SESSION_JWT_SECRET |
≥ 16-char string | HS256 signing key for player session JWTs | Boot fails |
PORTAL_JWT_SECRET |
≥ 16-char string | HS256 signing key for operator-portal login cookies | Boot fails |
SECRETS_MASTER_KEY_B64 |
32-byte base64 | AES-GCM key for encrypting OperatorCredential.cipherBlob |
Boot fails; rotating this requires re-encrypting every credential (§6) |
3.2 Optional (with defaults)¶
| Var | Default | Purpose |
|---|---|---|
PORT |
4500 |
HTTP + Socket.IO bind port |
CORS_ORIGIN |
http://localhost:3100,http://localhost:3101,http://localhost:3102 |
Comma-separated allowed origins. Set to production domains in prod. |
SIGNATURE_WINDOW_SECONDS |
30 |
Max clock skew on inbound HMAC requests. Do not exceed 60. |
GAME_CLIENT_BASE_URL |
http://localhost:3100 |
Base URL used in launch responses. |
WALLET_CALL_TIMEOUT_MS |
5000 |
Outbound wallet-call timeout. Beyond this → synthetic rollback. |
NODE_ENV |
development |
development / production / test. |
TURNSTILE_SECRET_KEY |
(unset → disabled) | Cloudflare Turnstile secret. When set, /v1/session requires a valid token. |
TURNSTILE_SITEVERIFY_URL |
https://challenges.cloudflare.com/turnstile/v0/siteverify |
Override for self-hosted siteverify. |
3.3 Observability (optional)¶
| Var | Default | Purpose |
|---|---|---|
OTEL_EXPORTER_OTLP_ENDPOINT |
(unset → disabled) | OTLP collector endpoint |
OTEL_SERVICE_NAME |
yantra-rgs |
Service name for traces |
OTEL_LOG_LEVEL |
info |
OTel SDK log verbosity |
3.4 Secret-strength guidance¶
SESSION_JWT_SECRET/PORTAL_JWT_SECRET: ≥ 32 random bytes, base64 or hex. Generate:SECRETS_MASTER_KEY_B64: exactly 32 bytes, base64-encoded:
In production, all three of these should come from a KMS (AWS KMS,
HashiCorp Vault, GCP Secret Manager) injected at boot, never from a static
env file. The dev .env.example placeholders are zero bytes, rotate them
before any non-dev use.
4. Health checks¶
The RGS exposes three HTTP paths for liveness/readiness:
| Path | Purpose | Fails when |
|---|---|---|
/healthz |
Liveness, process can answer HTTP | Never fails once listening |
/readyz |
Readiness, DB reachable, seed operators started | DB unreachable, engine registry not initialised |
/metrics |
Prometheus scrape endpoint | Returns empty metrics if OTel disabled |
Kubernetes probe recommendations:
readinessProbe:
httpGet: { path: /readyz, port: 4500 }
initialDelaySeconds: 5
periodSeconds: 3
failureThreshold: 3
livenessProbe:
httpGet: { path: /healthz, port: 4500 }
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 5
/readyz should fail closed during graceful shutdown so the load balancer
drains traffic before the process exits. The shipping shutdown handler in
index.ts does this.
5. Backup and restore¶
5.1 What to back up¶
Every money-path table. Specifically:
Operator,OperatorCredential(secrets are encrypted in the blob; still back up)OperatorGameConfig,OperatorConfigAuditLogGameSession,Round,Bet,PendingRoundBetWalletCall,PendingWalletJob,InboundIdempotencyOperatorUser
inbound_idempotency can be pruned after 24h in principle; in production
retain at least 30 days for dispute investigation.
5.2 Daily dump¶
pg_dump "$DATABASE_URL" \
--format=custom \
--no-owner \
--no-privileges \
--file="yantra_$(date -u +%Y%m%d).dump"
# encrypt before upload
age -r age1... -o yantra_$(date -u +%Y%m%d).dump.age yantra_$(date -u +%Y%m%d).dump
Upload the encrypted blob to object storage with object lock enabled so it cannot be tampered with before retention expires.
5.3 Retention¶
| Jurisdiction | Minimum retention |
|---|---|
| Unregulated / pre-launch | 90 days |
| Curaçao LOK | 5 years (for money-path and audit tables) |
| MGA Malta | 7 years |
| UKGC | 7 years |
Purging only kicks in when the longest applicable jurisdiction permits. In practice: never purge money-path rows older than the longest retention among onboarded operators.
5.4 Restore drill¶
Run a restore drill every quarter. Untested backups are worse than no backups, you'll find out they're corrupted when you need them.
# 1. Fresh DB
createdb yantra_restore_drill
pg_restore --dbname=yantra_restore_drill --no-owner yantra_YYYYMMDD.dump
# 2. Schema sanity
psql yantra_restore_drill -c 'SELECT count(*) FROM wallet_calls;'
psql yantra_restore_drill -c 'SELECT count(*) FROM rounds WHERE settled=false;'
# 3. Application smoke: boot the rgs-server against this DB, hit /readyz
DATABASE_URL=postgres://localhost:5432/yantra_restore_drill bun run dev:rgs
curl http://localhost:4500/readyz # must be 200
Document the drill outcome. Failed restores are incidents, file per incidents.md.
6. Key rotation¶
6.1 Operator credentials¶
Each OperatorCredential row carries notBefore, notAfter, and
revokedAt. The inbound-auth middleware accepts any credential for which
notBefore ≤ now < notAfter AND revokedAt IS NULL.
To rotate:
1. Portal → Operator Settings → Credentials → "Add new credential"
2. Overlap window: new credential notBefore = now; old credential notAfter = now + 7d
3. Distribute new (kid, secret) to the operator via secure channel (1Password, Vault, PGP)
4. After the operator confirms they're signing with the new credential, set
old credential revokedAt = now
5. Reject any inbound request using the old credential
Emergency rotation (suspected compromise): skip the overlap window. Revoke immediately, alert the operator, expect a brief integration outage.
6.2 SECRETS_MASTER_KEY_B64¶
This key encrypts every OperatorCredential.cipherBlob. Rotating it
requires re-encrypting every row:
1. Generate new key: openssl rand -base64 32
2. Write a one-shot script:
- Boot with OLD key
- SELECT every OperatorCredential → decrypt to plaintext
- Switch to NEW key
- Re-encrypt → UPDATE OperatorCredential.cipherBlob
3. Deploy with new key
4. Old key retained for read-only access to ciphertext older than the
rotation (backups, archived rows) for the retention window (§5.3)
Script skeleton lives in scripts/rotate-master-key.ts (not yet
implemented, add on first production rotation). Until then, this is a
manual procedure with a compliance lead present.
6.3 Session / portal JWT secrets¶
Rotating SESSION_JWT_SECRET invalidates every live player session ,
players will be kicked out mid-round. Mitigate by:
- Maintain a short JWT lifetime (≤ 60 min, already the default).
- Deploy the rotation at a low-traffic window.
- Optionally support dual-key verification during a rollover window by patching the session-auth middleware (not shipped by default).
PORTAL_JWT_SECRET rotation forces operator-portal users to re-login, a
brief, acceptable impact.
7. Database maintenance¶
7.1 Migrations¶
# Dev:
bun run db:migrate
# Prod (apply only, never `dev`):
cd apps/rgs-server && bunx --bun prisma migrate deploy
Migrations that change money-path tables are regated, see change-management.md §1.2.
7.2 Index health¶
Watch these indexes:
wallet_calls (operator_id, direction, endpoint, request_uuid): the dedupe unique index; if bloated, dedupe performance drops.rounds (operator_id, settled, created_at): crash-recovery scan.pending_wallet_jobs (completed_at, next_attempt_at): retry queue.
REINDEX CONCURRENTLY these quarterly or when pg_stat_user_indexes.idx_scan
reveals a slowdown.
7.3 Vacuum¶
Autovacuum defaults are adequate for volumes up to ~10M bets/day. Above
that, tune autovacuum_vacuum_scale_factor on the high-churn tables
(wallet_calls, rounds, bets).
8. Graceful shutdown¶
The RGS handles SIGTERM / SIGINT:
- Stop accepting new sessions (
/readyz→ 503). PendingJobRunner.stop(): current iteration finishes; no new jobs pulled.reconciliationJob.stop().EngineRegistry.stopAll(): every engine finishes its current round to SETTLED or VOIDED. New rounds blocked.httpServer.close()+io.close(): wait for active HTTP and WebSocket connections to drain.prisma.$disconnect().- Exit.
Do not kill -9 a running RGS. Crash recovery is safe, but unnecessary
voided rounds are bad player experience. The graceful path respects
in-flight rounds.
Drain SLO: graceful shutdown completes in ≤ 60 seconds under normal
load. If it doesn't, the orchestrator's kill-after timeout should be set
accordingly (e.g. Kubernetes terminationGracePeriodSeconds: 75).
9. Daily reconciliation¶
Run scripts/reconcile.ts against the operator's daily settlement file:
bun scripts/reconcile.ts \
--operator op_abc123 \
--currency LKR \
--date 2026-04-22 \
--file operator-settlement-2026-04-22.csv
CSV format, one row per (operatorId, date, currency):
operatorId,date,currency,betsAmountMicro,winsAmountMicro,rollbacksAmountMicro
op_abc123,2026-04-22,LKR,12500000000000,11675000000000,0
Exit codes:
0: MATCH (drift == 0)1: MINOR_DRIFT (< 0.1 %; log warning, do not page)2: MAJOR_DRIFT (≥ 0.1 %; page oncall immediately)3: input parse error
Wire into cron at 01:00 UTC daily, piping exit code to your alerting.
10. Monitoring dashboards¶
See observability.md for the full metrics catalogue. The minimum set you must have a dashboard on:
wallet_call_latency_ms{endpoint="bet"}: p50 / p99 / p99.9wallet_call_errors_total{endpoint="win", rs_status != "RS_OK"}: any non-zero rate pages oncallpending_wallet_jobsgauge, alert if any row is older than 5 minrtp_actual_rolling_24h{operator, currency}: alert at |drift| > 3σround_state_transitions_total{to_state="VOIDED"}: spike indicates crash-recovery or operator-terminate activity/readyzprobe success rate, Kubernetes readiness
11. Runbook version¶
| Field | Value |
|---|---|
| Runbook version | 1.0.0 |
| Last reviewed | 2026-04-23 |
| Review cadence | Every major release, or after any incident that exposes a gap |