-
Notifications
You must be signed in to change notification settings - Fork 4.8k
OTA-1637: ClusterOperators should not go Progressing only for a node reboot #30296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@hongkailiu: This pull request references OTA-1637 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
22125c5
to
dbb8c76
Compare
@hongkailiu: This pull request references OTA-1637 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/test e2e-aws-ovn-upgrade |
I am a bit surprised that there are no failures. For example, this job: 🤔 |
Job Failure Risk Analysis for sha: dbb8c76
Risk analysis has seen new tests most likely introduced by this PR. New tests seen in this PR at sha: dbb8c76
|
/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-gcp-ovn-rt-upgrade #30296 |
@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1adde290-995e-11f0-9d57-8ba68c7fd024-0 |
@hongkailiu: This PR was included in a payload test run from #30296
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1adde290-995e-11f0-9d57-8ba68c7fd024-0 |
dbb8c76
to
bad744e
Compare
The payload cmd above is working. The triggered job shows the code spotted some violations but also missed a few (comparing to the COs with ![]()
Refining the code to see why missed. |
/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-gcp-ovn-rt-upgrade #30296 |
@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/05382750-998d-11f0-963e-c3b711331506-0 |
@hongkailiu: This PR was included in a payload test run from #30296
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/05382750-998d-11f0-963e-c3b711331506-0 |
@hongkailiu: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Job Failure Risk Analysis for sha: bad744e
Risk analysis has seen new tests most likely introduced by this PR. New tests seen in this PR at sha: bad744e
|
bad744e
to
d415550
Compare
Same caught list before, still missing something. Let us try again with total number of events: /payload-job-with-prs periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-gcp-ovn-rt-upgrade #30296 |
@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5e26f440-9a14-11f0-8659-71352787cde2-0 |
@hongkailiu: This PR was included in a payload test run from #30296
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5e26f440-9a14-11f0-8659-71352787cde2-0 |
Job Failure Risk Analysis for sha: d415550
Risk analysis has seen new tests most likely introduced by this PR. New tests seen in this PR at sha: d415550
|
This is to cover the node rebooting case from the rule [1] that is introduced recently: ``` Operators should not report Progressing only because DaemonSets owned by them are adjusting to a new node from cluster scaleup or a node rebooting from cluster upgrade. ``` The test fails if - `co/machine-config` never became Progressing=True during a cluster upgrade, or - some CO left Progressing=False during the upgrade after `machine-config` became Progressing=True. This should not have taken place as `machine-config` was rebooting the nodes which was the only thing ongoing to the cluster during that time. [1]. https://github.com/openshift/api/blob/61248d910ff74aef020492922d14e6dadaba598b/config/v1/types_cluster_operator.go#L163-L164
d415550
to
969fcc5
Compare
Let us see if https://github.com/openshift/origin/compare/d41555013e72e50e055e6fc1ecf38229560c5b35..969fcc5a0bf55a5242c3e57a302f9e2fd2a04370 helps. /payload-job-with-prs periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-gcp-ovn-rt-upgrade #30296 |
@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7f025ef0-9a48-11f0-971e-3b18f6cdf426-0 |
@hongkailiu: This PR was included in a payload test run from #30296
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7f025ef0-9a48-11f0-971e-3b18f6cdf426-0 |
The last job failed too earlier for an irrelevant reason. /payload-job-with-prs periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-gcp-ovn-rt-upgrade #30296 |
@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/74f1e9a0-9a69-11f0-92f8-cb7320e8c78f-0 |
@hongkailiu: This PR was included in a payload test run from #30296
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/74f1e9a0-9a69-11f0-92f8-cb7320e8c78f-0 |
Job Failure Risk Analysis for sha: 969fcc5
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: 969fcc5
New tests seen in this PR at sha: 969fcc5
|
Spot one more CO missing: olm Update: |
/cc |
@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a7e9d410-9d27-11f0-82cf-a7612e5a4c23-0 |
@hongkailiu: This PR was included in a payload test run from #30296
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a7e9d410-9d27-11f0-82cf-a7612e5a4c23-0 |
/cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments inline, probably nothing showstopping so LGTM with a /hold
/lgtm
/approve
/hold
pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go
Show resolved
Hide resolved
pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go
Show resolved
Hide resolved
except := func(co string, condition *configv1.ClusterOperatorStatusCondition) string { | ||
return "" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing for a reader when it is not used yet. Can we add a comment or a bit of (commented-out) boilerplate that helps whoever reads this to 1) understand why this exists 2) how to actually add an exception (like, what should the function return? some kind of string but what should it look like?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original plan is to file the OCPBUGS and then use them as exceptions here after going through code review for those violating components we have found so far:
dns
image-registry
network
node-tuning
storage
kube-storage-version-migrator
csi-snapshot-controller
ingress
service-ca
olm
The exception function will look like
except := func(co string, condition *configv1.ClusterOperatorStatusCondition) string {
switch co {
case "dns":
if condition.Reason == "DNSReportsProgressingIsTrue" {
return "https://issues.redhat.com/browse/OCPBUGS-xxx"
}
}
return ""
}
Otherwise, it would cause payload failures like https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-30296-openshift-origin-30296-ci-4.21-upgrade-from-stable-4.20-e2e-gcp-ovn-rt-upgrade/1972624949952647168 after merge.
Do I understand it correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, so this PR is essentially still WIP and we need to fill these exceptions? Then I guess it is fine. My worry was that we'd merge this empty and then an uninformed reader would need the whole function to understand why except
is there.
pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go
Show resolved
Hide resolved
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hongkailiu, petr-muller The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is to cover the node rebooting case from the rule [1] that is introduced recently:
The test fails if
co/machine-config
never became Progressing=True during acluster upgrade, or
machine-config
became Progressing=True. This should nothave taken place as
machine-config
was rebooting the nodeswhich was the only thing ongoing to the cluster during that
time.
[1]. https://github.com/openshift/api/blob/61248d910ff74aef020492922d14e6dadaba598b/config/v1/types_cluster_operator.go#L163-L164