Nervous Machine ran on ibm_fez for 106 days from 2026-02-12 to 2026-05-28, registering 61,708 predictions — each one timestamped before the next-day calibration telemetry it targets — and scoring them against actuals as they arrived. The clusters were discovered from raw telemetry alone, against neutral priors, with no coupling-map input and no vendor-side hints. The scorecard, the benchmark against standard tools, and the calibration breakdown are below.
A cluster is registered when a group of qubits shows persistent within-group correlation above a noise-floor null distribution. Confirmation requires at least two distinct coordinated-motion events at least 14 days apart, each with within-cluster |r| above 0.5 in a 7-day rolling window. The two single-event clusters at the bottom of the table were registered close enough to the start of the collection window that earlier events could be present in the archive but outside what we pulled.
| Cluster | Type | Size | Members | Events | T1 co-moves | Peak |r| |
|---|---|---|---|---|---|---|
| 78e15a79 | chain | 2 | Q14, Q85 | 4 | 1/4 | 0.999 |
| 1cee4ce7 | chain | 2 | Q20, Q153 | 3 | 2/3 | 0.997 |
| 3625b1c4 | chain | 2 | Q58, Q109 | 3 | 1/3 | 0.984 |
| 49b87d14 | chain | 2 | Q81, Q142 | 3 | 0/3 | 0.975 |
| 8ef82ebd★ | chain | 9 | Q17, Q24, Q36, Q52, Q53, Q55, Q61, Q73, Q91 | 3 | 2/3 | 0.811 |
| d7dd4ee8 | regional | 2 | Q115, Q132 | 2 | 0/2 | 0.999 |
| f441a7fb | regional | 2 | Q6, Q8 | 2 | 0/2 | 0.994 |
| 72cad12f | chain | 3 | Q1, Q7, Q135 | 2 | 0/2 | 0.993 |
| 1fa2d19a | chain | 2 | Q22, Q136 | 2 | 1/2 | 0.988 |
| 6786bc2e | chain | 2 | Q103, Q113 | 2 | 0/2 | 0.982 |
| 51e064db | chain | 2 | Q9, Q146 | 2 | 0/2 | 0.982 |
| ca1cdf2a | chain | 4 | Q119, Q124, Q131, Q143 | 2 | 0/2 | 0.981 |
| 2f82883c | chain | 2 | Q31, Q75 | 2 | 0/2 | 0.974 |
| 4c96bbb9 | chain | 2 | Q15, Q116 | 2 | 1/2 | 0.953 |
| fe542648 | chain | 2 | Q66, Q117 | 2 | 1/2 | 0.939 |
| 16594770 | chain | 2 | Q40, Q100 | 1 | 0/1 | 0.992 |
| 6e26122e | regional | 2 | Q84, Q107 | 1 | 0/1 | 0.990 |
★ The 9-qubit cluster is the one we wrote up in detail on Substack: How uncertainty evolved around a 9-qubit cluster. T1 co-moves counts events with simultaneous T1 movement; the remainder are readout-only.
The substrate's event-window detector was benchmarked against five standard methods on the same 9-qubit telemetry, 106 days, T1 and readout error from MongoDB. Event-window AUC measures how cleanly each method separates inside-event days from outside-event days in the 12-day March cluster event.
| Method | What it natively produces | Event-window AUC |
|---|---|---|
| Nervous Machine (rolling bipartite) | Per-edge causal certainty + structural signal | 0.718 |
| Spectral clustering | Static partition once you choose k | diagnostic |
| Hierarchical clustering | Dendrogram + cluster threshold | diagnostic |
| PC algorithm (causal-learn) | Static DAG over variables | diagnostic |
| Per-qubit isolation forest | Per-(qubit, day) outlier flag | 0.369 |
| Per-day z-score | Per-(qubit, day) anomaly flag | 0.344 |
Spectral, hierarchical, and PC are diagnostic methods that produce partitions or graphs, not real-time event detectors — they're not directly AUC-comparable. The per-qubit detectors are. Both score below 0.5 on the event because the event manifests as correlated structure across the cluster; per-qubit detectors observe each qubit independently and miss the correlation by construction.
On the A/B bipartite split inside the 9-qubit cluster, spectral and hierarchical clustering recovered the partition exactly (Adjusted Rand Index = 1.000) when given the right 9-qubit correlation matrix — that finding isn't the substrate's differentiator. The differentiator is that the substrate flagged the two qubits the bipartite analysis would later split on (Q55 and Q61) by leaving their certainty Z stalled in the 0.40–0.65 band before any bipartite analysis ran.
Per-qubit failure flags typically lose precision as you extrapolate further out. The substrate's PERSISTENT_FAIL flag doesn't. Below is the breakdown of verdict precision against the base rate of each actual state at horizon, across 59,192 scored predictions on all 156 qubits. Lift = precision divided by the base rate at horizon — it's the right way to compare alerting signals, and it's what separates a real alert from a re-statement of the prior.
| Predicted verdict | 1d precision | 1d base rate | 1d lift | 7d precision | 7d base rate | 7d lift |
|---|---|---|---|---|---|---|
| PERSISTENT_FAIL | 71% | 7.5% | 9.5× | 71% | 8.2% | 8.7× |
| DEGRADING | 43% | 11.4% | 3.8× | 11% | 11.7% | 0.9× |
Three operational tiers fall out of this. PERSISTENT_FAIL is a real alert at both horizons — 9.5× lift at 1d and 8.7× lift at 7d on the same 71% precision is genuinely unusual for a per-qubit failure flag. DEGRADING is a usable short-horizon signal at 1d (3.8× lift) — useful for next-day scheduling decisions like deprioritizing flagged qubits for sensitive work — but the lift collapses to 0.9× at 7d, where the flag stops carrying signal. TRANSIENT_ANOMALY and DEGRADING extrapolated to 7d are roadmap — the substrate produces them, but their precision needs recalibration before they enter a paid alerting product.
WAIT_ENVIRONMENTAL is a different signal class entirely. It's rare by design — it fires when multiple qubits across the chip show signs of an environmental event simultaneously, and chip-wide events are uncommon. The May 8–16 window, in which 85 of 156 qubits lost ≥30% of T1 over a week, is the canonical example. Per-qubit horizon precision on the flag is noisy by construction because environmental events recover stochastically; the chip-wide co-fire pattern is the right diagnostic. The operational action it implies is wait, not reroute. Some of the apparent DEGRADING → HEALTHY → PERSISTENT_FAIL flapping observed in individual qubits during environmental windows is attributable to those qubits transiting an environmental event that WAIT_ENVIRONMENTAL was already tracking at the chip level.
Two evaluation modes the dashboard tracks separately, and the page names both on purpose. State forecasts — predict which verdict label a qubit will land in at horizon — score 57% overall state-match across all five states at 1d. That's the operational product the alerts are built on. Value forecasts — predict the exact T1 or readout-error number — score 51% overall value-accuracy. The value bar is harder, and serves as a residual diagnostic (Q53's +11.82σ z_residual on May 12 was the canonical example) rather than the basis of the alerting product. Leading with state-match without naming the value-accuracy bar would obscure how the product actually evaluates.
The 9-qubit cluster 8ef82ebd (Q17, Q24, Q36, Q52, Q53, Q55, Q61, Q73, Q91) was the original case. It had a 12-day March event window, a bipartite A/B split in the readout-error correlation structure, and a mechanism that localized to shared readout electronics rather than coherence. A second, larger bipartite event in May was visible in the substrate's signal before the calibration verdicts caught up to it.
The substrate's certainty for two of the nine members (Q55 at 0.413, Q61 at 0.522) sat in the stalled band 0.40–0.65 despite 50+ updates. Those are exactly the two qubits that straddle the bipartite A/B partition the cluster later split on. The framework raised its hand about the right partition before the partition was identified by other means. The full piece is on Substack: How uncertainty evolved around a 9-qubit cluster.
ibm_fez is the public proof. The substrate is built to run on any device that exposes a per-element calibration feed — quantum hardware, classical silicon, HPC cooling fleets, or model-based forecasting systems. The methodology stays the same. The integration is what's bespoke.
Shared-resource clusters surface from telemetry alone, against neutral priors. No coupling map, no vendor hints, no topology assumptions.
Every causal relationship carries its own evolving certainty. The substrate tells you which conclusions it stands behind and which it doesn't.
Structural signals are first-class outputs. The kinds of events per-qubit detectors are blind to are exactly what the substrate is built to surface.
Raw data stays on-site. Calibrated findings cross organizational boundaries. The production implementation is what we license.
The walk-through pulls current telemetry, runs through what the substrate is flagging right now, and goes deeper than the page on whichever findings you want to examine.