Skip to content

Operator visibility for update problems #9129

@karencfv

Description

@karencfv

There is already a PR with a first pass on what operator notes could/will look like #9116. In its description a few questions are asked, which we think are worth keeping in an issue so we can keep track of

  • Many of our "something might be wrong" notes fire normally during an upgrade, but only transiently. (E.g., we warn if a sled is missing from inventory, and that will be true for sleds that are in the process of rebooting.)
  • The planner report alone doesn't really have enough context to make intelligent decisions like "is this sled missing from inventory expected because we just updated it?"
  • We do have some notes that are of the same "this is bad" class as asserts: things that indicate some internal invariant is violated, and recovery requires support intervention.

Today we had our first update that was stuck due to an SP not coming back after being reset. The planner acted as expected but it was not very clear to an end user where the problem lay. Below are a few items which would have made the problem easier to understand.

  • At what time and which component on which sled was the last successful update.
  • Has a pending component update not been able to be planned more than n times?

In the case of the stuck MGS update that we had, the nexus driving the failed update was on the sled that had its unsuccessful update. That means that when we attempted to see the logs for the last update, there was nothing to see because there was no Nexus. We should be able to capture this information either in planning reports or somewhere instead of just logs.

Adjacent to this, both the RoT and RoT bootloader call boot info after a reset to verify that the component has come back before reporting an update as finished. It's probably a good idea to have something similar with all components.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Update SystemReplacing old bits with newer, cooler bits

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions