Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test-cmd flake: quota.sh:32: executing 'oc describe appliedclusterresourcequota/for-deads-by-annotation -n bar --as deads' #11560

Closed
csrwng opened this issue Oct 25, 2016 · 16 comments · Fixed by #11595
Assignees
Labels
area/tests kind/bug Categorizes issue or PR as related to a bug. kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1
Milestone

Comments

@csrwng
Copy link
Contributor

csrwng commented Oct 25, 2016

github.com/openshift/origin/test/cmd/quota/clusterquota.test/cmd/quota.sh:32: executing 'oc describe appliedclusterresourcequota/for-deads-by-annotation -n bar --as deads' expecting any result and text 'secrets.*1[0-9]'; re-trying every 0.2s until completion or 60.000s

https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_check/7682/

@csrwng csrwng added priority/P2 area/tests kind/test-flake Categorizes issue or PR as related to test flakes. labels Oct 25, 2016
@csrwng
Copy link
Contributor Author

csrwng commented Oct 25, 2016

@deads2k that's a very incriminating test, if I ever saw one :)

@bparees
Copy link
Contributor

bparees commented Oct 25, 2016

yeah, never put your own name in a test case :) (I just hit this also)

@bparees
Copy link
Contributor

bparees commented Oct 26, 2016

@stevekuznetsov this is starting to look like a good candidate for the "if it's not fixed in a day, back it out" philosophy.

@stevekuznetsov
Copy link
Contributor

@bparees not sure if we got buy-in on that or if I'm the one to do it, but yes I agree with you. @deads2k any updates?

@bparees
Copy link
Contributor

bparees commented Oct 26, 2016

@stevekuznetsov not yet, the leads call hasn't happened yet. i'm just building the case :)

@deads2k deads2k assigned mfojtik and unassigned deads2k Oct 26, 2016
@deads2k
Copy link
Contributor

deads2k commented Oct 26, 2016

@stevekuznetsov this is starting to look like a good candidate for the "if it's not fixed in a day, back it out" philosophy.

Depends on what you mean by "back it out". I think that commenting out the test (which @stevekuznetsov has proposed in the past) would be the wrong thing to do. This was likely caused by #11394 and so @mfojtik should have his chance to fix it before the pull is reverted (which I would be ok with).

But I don't see anyone referencing it in this issue, so I'm not sure what you'd back out.

@stevekuznetsov
Copy link
Contributor

But I don't see anyone referencing it in this issue, so I'm not sure what you'd back out.

Even saying the magic words got the likely culprit PR mentioned and the right person assigned... not sure I even needed to think about what to back out :)

@bparees
Copy link
Contributor

bparees commented Oct 26, 2016

@deads2k again, it needs to be discussed w/ the team, but i'm prepared to say

  1. if we don't know what broke the test, the test gets disabled while the person who owns the test investigates
  2. if we know what broke it, the pull that broke it gets reverted immediately. it never should have gone in (it should have failed testing due to the flakes it introduced) so why would we leave it in?
  3. in a slightly less harsh world, i'd be amenable to an immediate fix being delivered in lieu of reverting, but if the fix is not known immediately, then reverting the PR seems like the right thing to do.

I think it's important to recognize the huge productivity cost flakes have on teams trying to test/merge code. Every time someone hits a flake they have to go investigate why things failed, look up the issue associated, and then wait for another round of test/merge processing. It's pretty costly, not to mention how discouraging/frustrating it is for developers. I don't really feel like our current flake triaging/prioritizing process is addressing that.

@deads2k
Copy link
Contributor

deads2k commented Oct 26, 2016

  1. if we don't know what broke the test, the test gets disabled while the person who owns the test investigates

This condition shouldn't be likely. Even if we don't run it all the time, a bisect to find the offender ought to be something the test infrastructure can answer in a couple hours. Spawn 10+ concurrent jobs for every likely commit and see which ones flake on the new problem. "I don't know" seems like an unreasonable thing to say. For things like this, if the test got disabled, I think its unlikely someone would have chased it down.

As I noted in my comment, 2 and 3 seem pretty reasonable to me, but simply "let's disable this test" doesn't seem like a good path forward to me.

@liggitt
Copy link
Contributor

liggitt commented Oct 26, 2016

it's not an issue with the test, it's a bug:

- apiVersion: v1
  kind: ClusterResourceQuota
  metadata:
    creationTimestamp: 2016-10-26T14:12:57Z
    name: for-deads-by-annotation
    resourceVersion: "1965"
    selfLink: /oapi/v1/clusterresourcequotas/for-deads-by-annotation
    uid: 4b6272f8-9b86-11e6-aa70-acbc32c1ca87
  spec:
    quota:
      hard:
        secrets: "50"
    selector:
      annotations:
        openshift.io/requester: deads
      labels: null
  status:
    namespaces:
    - namespace: foo
      status:
        hard:
          secrets: "50"
        used:
          secrets: "9"
    - namespace: bar
      status:
        hard:
          secrets: "50"
        used:
          secrets: "9"
    total:
      hard:
        secrets: "50"
      used:
        secrets: "24"

@deads2k
Copy link
Contributor

deads2k commented Oct 26, 2016

it's not a test break, it's a bug:

That's neat. I still don't think the right answer to comment out the test.

@bparees
Copy link
Contributor

bparees commented Oct 26, 2016

For things like this, if the test got disabled, I think its unlikely someone would have chased it down.

yeah that's a concern, obviously if we disable a test we need a tracking issue of high priority(p1/blocker) to ensure the test does get re-enabled at some point in the future.

@liggitt
Copy link
Contributor

liggitt commented Oct 26, 2016

to reproduce, run this over and over until the last output shows the mismatch between the total and the sum of the individual namespaces:

oc delete project foo bar
oc delete clusterresourcequota --all
oc create clusterquota for-deads-by-annotation --project-annotation-selector=openshift.io/requester=deads --hard=secrets=50
oc new-project foo --as=deads
oc new-project bar --as=deads
oc get clusterresourcequota/for-deads-by-annotation -o yaml

@liggitt
Copy link
Contributor

liggitt commented Oct 26, 2016

one possibility:

checkAttributes()/checkQuotas() assume they are free to modify the quotas returned from the accessor

if checkQuotas() fails to update a quota, it refetches from the accessor, and recurses into checkQuotas(), reapplying the request attributes to the returned quotas

the clusterresourcequota accessor does not return copies, but original fields from the cache

@liggitt liggitt added kind/bug Categorizes issue or PR as related to a bug. priority/P1 and removed priority/P2 labels Oct 26, 2016
@liggitt liggitt added this to the 1.4.0 milestone Oct 26, 2016
@liggitt
Copy link
Contributor

liggitt commented Oct 26, 2016

clusterresourcequota status totals are getting out of sync with the sum of the namespaces. still debugging why. mustfix for 1.4

@deads2k
Copy link
Contributor

deads2k commented Oct 27, 2016

found it in deep copies: #11621

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tests kind/bug Categorizes issue or PR as related to a bug. kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1
Projects
None yet
6 participants