CAT Sentinel › Risk Analyzer › Model Observability
Model Observability & Evaluation
Mock-up · synthetic data Evaluation Store · live
Performance, training-loop health, and AI-output consumption across all 27 CAT Synapse models. Five tiers — drill from a single event to the whole fleet, then read the AI-generated insight layer.
T1 Per-Event Scorecard
T2 Per-Model Drill-Down
T3 Health Check
T4 Training-Loop Monitor
AI CAT AI Model Insights
⌔ Search models & events…
Peril / Severity
Wildfire · HIGH
Event class: Likely
Policyholders served
992
100% autonomous · no human gate
Versions tracked
v1 → v3
3 NWS reissues · C-4
Models engaged
9
of 27 · wildfire model chain
Prediction vs. observed outcome — per model
Model
Predicted
Observed
Confidence
Error
Reading this: error is the gap between prediction and ground truth from field reports & claims. Green = within tolerance, amber = notable, red = material miss requiring review.
Event calibration point
Predicted CRS 7.4 vs. observed impact 7.1 — within tolerance
Consumption — confidence gate
0.91
mean confidence
HIGH confidence band
Full autonomous response executed — alerts + adjuster pre-positioning, no qualification.
Models healthy
23/27
green status · view ›
Needs attention
3
amber · drift or accuracy dip · view ›
Circuit-breaker open
1
M-12 · autonomy suspended · view ›
Pending rollbacks
1
awaiting confirmation · go to monitor ›
Peril models ranked by false-negative rate — not overall accuracy (spec §3.2)
Fleet status grid — 27 models by category
Healthy
Attention
Circuit-breaker open
· click a model to open its drill-down
Updates · last 30d
41
per-event · asynchronous
Improved next event
37
90% positive trajectory
Auto-rolled back
3
degraded → reverted
Held — low ground truth
1
thin claims evidence
Weight-delta magnitude per update
Stability check: deltas should stay small and flat. A growing series signals an unstable loop — oscillation alert fires above the dashed line.
Update log — did it help?
| Update | Model | Δ mag. | GT quality | Next-event verdict | State |
|---|
Post-update performance trajectory
For each update: model performance on the next comparable event. The decisive training-health test — a dip after an update triggers automatic rollback.
Governance actions — recent
Auto-rollback executed M-08
2026-05-19 14:22 · update U-7731
Heat Wave DSTCE accuracy fell 4.1% on the next event after a CIL update. Prior validated weights restored automatically.
Circuit-breaker tripped M-12
2026-05-17 09:05 · drift threshold
Special Events DSTCE input drift exceeded limit. Autonomy suspended — predictions still recorded, alerts widened, escalated to human review. Fail-safe, not fail-silent.
Update held M-18
2026-05-16 18:40 · low GT quality
ProxDelta update derived from an event with sparse field reports. Held pending more ground truth rather than applied on thin evidence.
Re-validation passed M-01
2026-05-15 11:00 · periodic benchmark
Wildfire DSTCE re-validated against held-out benchmark independent of the CIL. Slow-drift check clear.
✦
CAT AI Model Insights
An LLM-generated analyst layer that reads the Evaluation Store and explains, in plain language, what the observability data shows for a selected model. Every insight is grounded — it cites the specific metrics behind it. This layer is read-only: it interprets and explains, it does not make decisions or retrain models.
This Model
Health Check
How to read this layer. Insights are generated by a language model from synthetic Evaluation Store data and are illustrative only. The LLM does not gate autonomous actions and is not part of the CIL retraining loop — it is an interpretation aid. Every claim links to its source records; always confirm against the underlying tier before acting.
CAT Sentinel — Model Observability & Evaluation.
Mock-up integrated to CAT Sentinel UI standards. All figures are synthetic, for design illustration only.
Five tiers: Per-Event Scorecard · Per-Model Drill-Down · Health Check · Training-Loop Monitor · CAT AI Model Insights.
Confidential · CAT Sentinel · May 2026.