PlanningReport: first pass at operator notes #9116

jgallagher · 2025-09-29T17:49:48Z

We chatted about this briefly during today's update watercooler (recorded). Some thoughts and issues from that discussion:

Many of our "something might be wrong" notes fire normally during an upgrade, but only transiently. (E.g., we warn if a sled is missing from inventory, and that will be true for sleds that are in the process of rebooting.)
The planner report alone doesn't really have enough context to make intelligent decisions like "is this sled missing from inventory expected because we just updated it?"
We do have some notes that are of the same "this is bad" class as asserts: things that indicate some internal invariant is violated, and recovery requires support intervention.

This PR only adds an in-memory version of the notes, and it attempts to mostly not emit things that are expected during updates. I'm sure I missed some in both directions, but it's a starting point. A proposal is to bikeshed this PR a bit until we're happy enough, knowing that as of this PR, these notes are only visible via omdb. If we can land this and start using it during updates, we might get some idea of whether these are worth surfacing to the external API at all in the R17 time frame, or whether we want to get more update experience first. If we do want to surface these, we'll need to serialize them as part of the blueprint, add them to the external API, and maybe add a way to see the notes from the planner task and not just the most recent target blueprint. All of those should be straightforward.

These should be (a) serialized in the db along with the blueprint they came from and (b) be available in the external API, but I wanted to get a first pass out so folks could look over the way I've pulled these out. I'm not at all sure this is the best way to generate these strings, and even if it is, they undoubtedly need some serious wordsmithing for external consumption and consistency.

smklein · 2025-09-29T18:48:32Z

dev-tools/omdb/src/bin/omdb/db/blueprints.rs


+    let operator_notes = report.operator_notes().into_notes();
+    if !operator_notes.is_empty() {
+        println!("\nnotes for customer operator:");


tiniest of nits; you think we could just say "notes for operator", here and below? "customer operator" feels like an unnecessary qualifier

smklein · 2025-09-29T18:53:58Z

nexus/types/src/deployment/planning_report/operator_notes.rs

+                    "{nsleds_error_manifest} sled{} have unexpected errors \
+                     on their boot disks",


WDYT about:

Suggested change

"{nsleds_error_manifest} sled{} have unexpected errors \

on their boot disks",

"{nsleds_error_manifest} sled{} have unexpected errors \

and are unable to access their stored configuration",

(It happens to be stored on the boot disk, but the problem is that we can't access the configuration for zones when we want it)

smklein · 2025-09-29T19:11:50Z

nexus/types/src/deployment/planning_report/operator_notes.rs

+
+#[derive(Debug, Clone)]
+pub struct PlanningOperatorNotes {
+    notes: Vec<String>,


I'm seeing a few types of notes falling out from this initial construction:

"This should never happen" (e.g., orphaned disks, zombie sleds)

"This could happen, and is blocking planning from proceeding" (e.g., no target set, we don't have a disk to place a necessary NTP service, etc)

"This could happen, and should be transient, but may be an issue if it's recurring" (e.g., something not seen in inventory)

"This is informational, and may or may not be expected behavior". (e.g., converting sleds to known versions)

I wonder if it's worth pulling these categories apart to highlight certain higher-value messages (e.g., "you're watching update status, but never set a target!") from lower-value messages (e.g., "we did work, and things progressed as expected").

Don't have to do this yet, as we're only exposing this info to omdb, but this seems like a spot where we might add that "categorization" explicitly.

We will need categories like those in order to surface only the problems in the external API (not sure we will wants to include the info notes).

smklein · 2025-09-29T19:15:11Z

nexus/types/src/deployment/planning_report/operator_notes.rs

+            }
+            if npools > 0 {
+                self.notes.push(format!(
+                    "{npools} disk{} are unavailable across {nsleds} sled{}",


Suggested change

"{npools} disk{} are unavailable across {nsleds} sled{}",

"{npools} disk{} are missing storage service across {nsleds} sled{}",

jgallagher requested review from davepacheco, plotnick and smklein September 29, 2025 17:49

smklein approved these changes Sep 29, 2025

View reviewed changes

karencfv mentioned this pull request Oct 1, 2025

Operator visibility for update problems #9129

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PlanningReport: first pass at operator notes #9116

PlanningReport: first pass at operator notes #9116

Uh oh!

jgallagher commented Sep 29, 2025

Uh oh!

smklein Sep 29, 2025

Uh oh!

smklein Sep 29, 2025

Uh oh!

smklein Sep 29, 2025

Uh oh!

david-crespo Sep 29, 2025

Uh oh!

smklein Sep 29, 2025

Uh oh!

Uh oh!

		"{nsleds_error_manifest} sled{} have unexpected errors \
		on their boot disks",

	"{npools} disk{} are unavailable across {nsleds} sled{}",
	"{npools} disk{} are missing storage service across {nsleds} sled{}",

PlanningReport: first pass at operator notes #9116

Are you sure you want to change the base?

PlanningReport: first pass at operator notes #9116

Uh oh!

Conversation

jgallagher commented Sep 29, 2025

Uh oh!

smklein Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

smklein Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

smklein Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

david-crespo Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

smklein Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!