Observability¶
The RGS is instrumented end to end with OpenTelemetry (metrics + traces) and structured JSON logging. Everything exports over OTLP; no vendor lock-in. This document describes what is instrumented, how it is queried, and which alerts are wired up.
Cross-reference security.md for the SLO targets that these metrics serve.
OpenTelemetry setup¶
The RGS uses @opentelemetry/sdk-node with auto-instrumentation for HTTP, Express,
and Prisma. A single environment variable switches the export target:
OTEL_EXPORTER_OTLP_ENDPOINT=https://otel-collector.example:4317
OTEL_SERVICE_NAME=yantra-rgs
OTEL_RESOURCE_ATTRIBUTES="deployment.environment=prod,service.version=1.4.2"
Instrumented subsystems out of the box:
@opentelemetry/instrumentation-http: every inbound and outbound HTTP request gets a span.@opentelemetry/instrumentation-express: Express middleware and route spans.@opentelemetry/instrumentation-socket.io: WebSocket handshakes and event handlers.@prisma/instrumentation: every database query.
Custom instrumentation is layered on top in apps/rgs-server/src/telemetry/
(index.ts wires the OTel SDK; metrics.ts declares the custom counters /
histograms / gauges): game-engine transitions, wallet calls, RNG determinism
hashes.
Prometheus scrape endpoint: GET /metrics: exports the OpenMetrics format for any
Prometheus-compatible collector (Prometheus, Grafana Agent, VictoriaMetrics).
Custom metrics¶
The seven metrics that matter most. Labels are always scoped by operator_id so
dashboards and alerts can be per-tenant.
| Metric | Type | Labels | Meaning |
|---|---|---|---|
wallet_call_latency_ms |
Histogram | operator_id, endpoint, status |
Outbound wallet-call latency distribution |
wallet_call_errors_total |
Counter | operator_id, endpoint, rs_status |
Non-OK responses; alert on non-zero rate for endpoint="win" |
bet_to_settlement_ms |
Histogram | operator_id, game_code |
End-to-end time from bet accepted to settlement confirmed |
rtp_actual_rolling_24h |
Gauge | operator_id, game_code, currency |
Rolling 24h actual RTP from the ledger |
pending_wallet_jobs |
Gauge | operator_id, endpoint |
Count of rows in PendingWalletJob not yet completed |
session_active |
Gauge | operator_id |
Concurrent live sessions per operator |
round_state_transitions_total |
Counter | operator_id, from_state, to_state |
Round lifecycle transitions |
Supporting metrics (also exported)¶
| Metric | Purpose |
|---|---|
wallet_call_total |
Counter; denominator for error-rate queries |
http_server_duration_seconds |
Standard OpenTelemetry HTTP duration |
db_query_duration_ms |
Prisma query latency per model and operation |
circuit_state |
0=closed, 1=half-open, 2=open, per (operator, endpoint) |
idempotency_cache_hit_total |
Counter; divides by request total for hit rate |
Example PromQL queries¶
Wallet bet outbound p99¶
histogram_quantile(
0.99,
sum by (le, operator_id) (
rate(wallet_call_latency_ms_bucket{endpoint="bet"}[5m])
)
)
Wallet call error rate per operator¶
sum by (operator_id, endpoint) (
rate(wallet_call_errors_total[5m])
)
/
sum by (operator_id, endpoint) (
rate(wallet_call_total[5m])
)
Bet-to-settlement p99 per game¶
RTP drift: absolute deviation from theoretical¶
Stuck retry queue¶
max by (operator_id, endpoint) (
pending_wallet_jobs
)
> 0
and on() (time() - pending_wallet_jobs_oldest_seconds > 300)
Active sessions right now¶
Circuit breaker open¶
RTP drift monitor¶
A scheduled job computes the rolling 24h actual RTP per (operator_id, game_code,
currency) and exports it as rtp_actual_rolling_24h. The computation:
SELECT
operator_id,
game_code,
currency,
SUM(total_payouts_micro)::numeric / NULLIF(SUM(total_bets_micro), 0)::numeric AS rtp
FROM rounds
WHERE settled = true
AND settled_at > now() - interval '24 hours'
GROUP BY operator_id, game_code, currency;
The alert fires at ±3σ from theoretical. σ is per-game: it depends on the per-round payout variance declared in the PAR sheet, not a global constant.
For a round whose per-round payout has standard deviation σ₁ (in RTP units),
the observed 24h RTP over N rounds has standard deviation
σ = σ₁ / sqrt(N). The PAR sheet for each game publishes σ₁:
| Game | σ₁ (per-round payout stdev) |
σ at N=10k rounds |
3σ gate |
|---|---|---|---|
| Ketapola Dice (2× symmetric, 98% RTP) | ~1.00 | 0.01 (1.0pp) | ±3.0pp |
| Crash Minimal (heavy-tailed, 99% RTP) | ~3.6 (cashout-dependent) | 0.036 (3.6pp) | ±10.8pp |
Slot-style games with jackpot / free-spin features have materially higher σ₁
and the gate is derived from the per-game PAR sheet at cert submission, not
hardcoded. Per-operator overrides are read from OperatorGameConfig.configJson;
heavy-volume operators typically tighten.
Alert query:
(
abs(
rtp_actual_rolling_24h
- on(game_code) group_left() rtp_theoretical
)
) > (3 * on(game_code) group_left() rtp_sigma_24h)
(rtp_theoretical and rtp_sigma_24h are Prometheus recording rules populated
per (operator_id, game_code, currency) from operator_game_configs and the
per-game PAR sheet, rtp_sigma_24h = sigma_per_round / sqrt(rounds_24h).)
What the alert catches, in decreasing order of frequency:
- Bot / collusion attacks, large RTP swings from abnormal player win rates.
- Weight-config errors after a change, someone saves
(lowWeight=50, highWeight=30)with commission unchanged. - RNG regressions, rare but catastrophic.
- Statistical noise, rare at 24h window; fires a warning not a page.
Suggested Grafana dashboard¶
Single-pane-of-glass layout for an on-call shift. Sections, top to bottom:
Top row: "is anything on fire"¶
- Big number: current RGS uptime % (last 30 days).
- Big number: wallet-call error rate (last 5 min).
- Big number: pending retry-queue depth.
- Big number: active sessions across all operators.
Second row: latency¶
- Time series:
/wallet/betp50 / p95 / p99, stacked byoperator_id. - Time series: bet-to-settlement p50 / p95 / p99.
- Time series:
/v1/sessionp50 / p95 / p99.
Third row: error budget¶
- Burn-down:
/wallet/bet30-day error budget remaining (1% - current rate × time). - Heatmap: errors per operator × endpoint × hour.
Fourth row: RTP drift¶
- Time series:
rtp_actual_rolling_24hper operator × game × currency with a horizontal line at theoretical RTP and a band at ±3σ. - Table: worst-performing operators (largest drift) with links to their wallet-call logs.
Fifth row: round loop¶
- Counter:
round_state_transitions_totalsankey or stacked-bar. - Gauge: circuit-breaker states per operator.
- Counter: RNG determinism check pass/fail (should always be 100% pass).
Repo: a dashboard JSON skeleton lives at ops/grafana/rgs-overview.json (if present);
otherwise import from the Grafana gallery using these queries.
SIEM integration¶
SOC 2 Type II and MGA / AGCO / SPA supervisory reviews require audit trails to flow into a SIEM with a documented retention + retention-integrity story. The RGS emits every security-relevant event to stdout as structured JSON; shippers (Vector, Filebeat, Fluent Bit) route to the operator's SIEM.
What ships to the SIEM¶
| Event class | Source | Retention guidance | Field mapping |
|---|---|---|---|
| Wallet calls (audit ledger) | WalletCall rows → CDC stream via Debezium / Logical replication |
7 years (GLI-19 minimum; UKGC LCCP requires 5) | operator_id, player_ref, endpoint, requestUuid, transactionUuid, amountMicro, status, latency_ms, error, http_status, ts |
| Session lifecycle | GameSession rows on INSERT/UPDATE |
7 years | session_id, operator_id, player_ref, game_code, currency, jurisdiction, mode, created_at, terminated_at, terminated_reason |
| Round settlement | Round rows on settlement |
7 years | round_id, session_id, operator_id, game_code, outcome_type, outcome_data, total_bets_micro, total_payouts_micro, server_seed_hash, rng_version, math_version |
| Admin-portal actions | OperatorConfigAuditLog / AdminAuditLog |
7 years | portal_user_id, operator_id, action, target_field, old_value, new_value, ip, ts |
| Authentication events | Structured log category auth |
1 year | operator_id, kid, event (login_success, login_fail, mfa_success, mfa_fail, ip_rejected, signature_invalid, replay_window_exceeded), ip, ts |
| RG limit trips | Structured log category rg: also mirrored as a webhook |
7 years | session_id, operator_id, player_ref, limit_type, stake_micro, remaining_micro, ts |
| Circuit breaker transitions | Structured log category circuit |
90 days (operational) | operator_id, state_from, state_to, failure_count, ts |
| Kill-switch activations | GlobalKillSwitch rows + log category killswitch |
7 years | operator_id, scope, activated_by, activated_at, deactivated_at, reason |
Field mapping conventions¶
- Elastic Common Schema (ECS): the JSON logs already follow ECS where
possible:
@timestamp,event.category,event.action,event.outcome,user.id,client.ip,http.request.method,http.response.status_code. The Fluent Bit / Vector shipper does not need a custom schema. - Splunk CIM: a Vector
remapruleset atops/siem/vector-splunk.vrl(shipped per-deployment, not in-repo, operator-specific index / sourcetype) projects ECS onto Splunk's Authentication / Change / Network Sessions models. - Datadog / Sumo Logic: ingest ECS JSON directly; no projection needed.
Retention-integrity¶
The WalletCall, Round, GameSession, and OperatorConfigAuditLog tables
are append-only: no UPDATE to outcome-bearing columns, no DELETE
except via the per-jurisdiction retention purge job (off by default; must be
explicitly enabled per-operator with a documented retention policy).
The AuditChain service (services/AuditChain.ts) hash-chains every
WalletCall row per-operator and signs a daily anchor, so a SIEM that
ingests the WalletCall stream can recompute the chain and detect any
upstream tampering. The daily anchor signature is logged to SIEM under
category auditchain.anchor.
What does NOT ship to the SIEM¶
/metricsPrometheus scrape is separate, that's metrics, not events.- Player PII, the RGS does not have any.
playerRefis opaque. - HMAC secrets, session JWT secrets, master keys, never logged; attempts
to log them are redacted by
telemetry/logger.ts::redactor. - Request bodies beyond the idempotency key and status, bodies can carry commercially sensitive config; if a licensee needs full-body logging for a specific incident, it's a per-operator opt-in with a 30-day TTL.
SLO burn-rate alerts¶
Following the multi-window, multi-burn-rate pattern from the Google SRE workbook. For each SLO, two parallel alerts at different windows fire at different urgency.
Example: /wallet/bet error budget (0.1% over 30 days)¶
Fast burn (page immediately):
Slow burn (ticket for daytime fix):
Burn rate: 1× (would exhaust full budget in 30 days)
Windows: 6h short × 30m long
Severity: SEV-3 ticket
PromQL sketch (fast):
(
sum(rate(wallet_call_errors_total{endpoint="bet"}[1h]))
/
sum(rate(wallet_call_total{endpoint="bet"}[1h]))
) > (14.4 * 0.001)
AND
(
sum(rate(wallet_call_errors_total{endpoint="bet"}[5m]))
/
sum(rate(wallet_call_total{endpoint="bet"}[5m]))
) > (14.4 * 0.001)
Both windows must cross threshold simultaneously, this kills the false-positive rate on short noisy spikes.
Apply the same pattern to latency SLOs (replace error-rate ratio with
1 - p99 < target).
Log structure¶
All logs are JSON lines. No free-form strings. Every record carries the correlation fields that let you join logs, traces, and metrics together.
{
"ts": "2026-04-23T14:32:05.148Z",
"level": "info",
"msg": "wallet call committed",
"operator_id": "op_abc",
"session_id": "5bb8...",
"round_id": "rnd_8a2c...",
"bet_id": "bet_9f3a...",
"trace_id": "4bf92f35...",
"span_id": "00f067aa...",
"endpoint": "bet",
"rs_status": "RS_OK",
"latency_ms": 42
}
Scrubbing¶
Three classes of field never appear in logs:
- Raw secrets (
apiSecret,walletSecret, JWT tokens). - Signature header values.
- PII.
playerRefis logged because it is opaque by definition; real PII never reaches the RGS. If an operator accidentally sends a raw email asplayerRef, the log scrubber detects it heuristically and masks it.
Retention¶
Application logs: 30 days hot, 1 year cold (S3 / equivalent).
Audit logs (WalletCall, OperatorConfigAuditLog): 7 years, never purged.
Trace structure¶
Every request produces a trace. Important spans and their attributes:
| Span name | Attributes | Description |
|---|---|---|
HTTP POST /v1/session |
operator_id, http.status_code, http.route |
Inbound session creation |
session.create |
operator_id, session_id, player_ref, currency |
Session service layer |
socket.handshake |
operator_id, session_id, player_ref |
WebSocket auth |
round.start |
operator_id, round_id, session_id, nonce |
Round state machine entry |
bet.place |
operator_id, bet_id, round_id, amount_micro, side |
Player bet intake |
wallet.call |
operator_id, endpoint, attempt, rs_status, latency_ms |
Outbound wallet HTTP |
round.roll |
round_id, outcome_side, outcome_sum |
RNG computation |
round.settle |
round_id, bets_won, bets_lost, total_payouts_micro |
Settlement loop |
rng.verify |
round_id, match |
Determinism self-check |
prisma.query |
db.statement, db.operation, db.model |
Every DB query |
Correlation¶
trace_id is attached to:
- Every log line emitted during the trace.
- Every outbound wallet HTTP call as the
traceparentheader (W3C Trace Context). - Every Prisma query via the Prisma OTel integration.
- Every metric recorded during the span gets an
exemplarpointing to the trace.
This means a single trace click from "slow bet latency" metric goes straight to the Prisma query + wallet HTTP call + application log for that one bet.
See also¶
- security.md, the SLO targets these metrics defend.
- wallet-api.md, retry semantics that
drive
pending_wallet_jobs. - Per-game PAR sheet (e.g. games/ketapola-dice/docs/par-sheet.md, games/crash-minimal/docs/par-sheet.md), theoretical RTP values used by the drift alert.
apps/rgs-server/src/telemetry/: the instrumentation source.