Skip to content

Conversation

jgallagher
Copy link
Contributor

We chatted about this briefly during today's update watercooler (recorded). Some thoughts and issues from that discussion:

  • Many of our "something might be wrong" notes fire normally during an upgrade, but only transiently. (E.g., we warn if a sled is missing from inventory, and that will be true for sleds that are in the process of rebooting.)
  • The planner report alone doesn't really have enough context to make intelligent decisions like "is this sled missing from inventory expected because we just updated it?"
  • We do have some notes that are of the same "this is bad" class as asserts: things that indicate some internal invariant is violated, and recovery requires support intervention.

This PR only adds an in-memory version of the notes, and it attempts to mostly not emit things that are expected during updates. I'm sure I missed some in both directions, but it's a starting point. A proposal is to bikeshed this PR a bit until we're happy enough, knowing that as of this PR, these notes are only visible via omdb. If we can land this and start using it during updates, we might get some idea of whether these are worth surfacing to the external API at all in the R17 time frame, or whether we want to get more update experience first. If we do want to surface these, we'll need to serialize them as part of the blueprint, add them to the external API, and maybe add a way to see the notes from the planner task and not just the most recent target blueprint. All of those should be straightforward.

These should be (a) serialized in the db along with the blueprint they
came from and (b) be available in the external API, but I wanted to get
a first pass out so folks could look over the way I've pulled these out.
I'm not at all sure this is the best way to generate these strings, and
even if it is, they undoubtedly need some serious wordsmithing for
external consumption and consistency.

let operator_notes = report.operator_notes().into_notes();
if !operator_notes.is_empty() {
println!("\nnotes for customer operator:");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiniest of nits; you think we could just say "notes for operator", here and below? "customer operator" feels like an unnecessary qualifier

Comment on lines +179 to +180
"{nsleds_error_manifest} sled{} have unexpected errors \
on their boot disks",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about:

Suggested change
"{nsleds_error_manifest} sled{} have unexpected errors \
on their boot disks",
"{nsleds_error_manifest} sled{} have unexpected errors \
and are unable to access their stored configuration",

(It happens to be stored on the boot disk, but the problem is that we can't access the configuration for zones when we want it)


#[derive(Debug, Clone)]
pub struct PlanningOperatorNotes {
notes: Vec<String>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing a few types of notes falling out from this initial construction:

  • "This should never happen" (e.g., orphaned disks, zombie sleds)
  • "This could happen, and is blocking planning from proceeding" (e.g., no target set, we don't have a disk to place a necessary NTP service, etc)
  • "This could happen, and should be transient, but may be an issue if it's recurring" (e.g., something not seen in inventory)
  • "This is informational, and may or may not be expected behavior". (e.g., converting sleds to known versions)

I wonder if it's worth pulling these categories apart to highlight certain higher-value messages (e.g., "you're watching update status, but never set a target!") from lower-value messages (e.g., "we did work, and things progressed as expected").

Don't have to do this yet, as we're only exposing this info to omdb, but this seems like a spot where we might add that "categorization" explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need categories like those in order to surface only the problems in the external API (not sure we will wants to include the info notes).

}
if npools > 0 {
self.notes.push(format!(
"{npools} disk{} are unavailable across {nsleds} sled{}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"{npools} disk{} are unavailable across {nsleds} sled{}",
"{npools} disk{} are missing storage service across {nsleds} sled{}",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants