Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node Selector assigns wrong architecture for builds #2505

Closed
5 tasks done
pat-s opened this issue Sep 27, 2023 · 12 comments · Fixed by #2531
Closed
5 tasks done

Node Selector assigns wrong architecture for builds #2505

pat-s opened this issue Sep 27, 2023 · 12 comments · Fixed by #2531
Labels
bug Something isn't working

Comments

@pat-s
Copy link
Contributor

pat-s commented Sep 27, 2023

Component

agent

Describe the bug

Recently, I am seeing builds being assigned to the wrong nodes on our cluster.
Inspecting the pods, I can see that the platform env, which comes from the matrix definition, is set to amd64. However, the nodeSelector is set to arm64 instead.

Environment:
  platform:                        linux/amd64 

[...]

Node-Selectors:              kubernetes.io/arch=arm64 

The WF definition is as follows

platform: ${platform}

matrix:
  include:
    - DISTRIBUTION: ubuntu
      RELEASE: focal
      OS_RELEASE: 20.04
      platform: linux/amd64
    - DISTRIBUTION: ubuntu
      RELEASE: jammy
      OS_RELEASE: 22.04
      platform: linux/amd64
    - DISTRIBUTION: redhat/ubi8-minimal
      RELEASE: latest
      platform: linux/amd64
      OS_RELEASE: rhel8
    - DISTRIBUTION: redhat/ubi9-minimal
      RELEASE: latest
      platform: linux/amd64
      OS_RELEASE: rhel9

Could it be that #2048 changed the logic in a way that nodeSelector is always set to CI_SYSTEM_PLATFORM and does not honor platform env anymore?

platform, exist := step.Environment["CI_SYSTEM_PLATFORM"]

Or maybe one of the changes within the recent months caused this behavior?
Tagging @6543 and @zc-devs as you both did modifications in this area lately.

I am not having enough experience in GO to make an educated judgement right now - all I can see is that CI_SYSTEM_PLATFORM is not honoring platform env and therefore the nodeSelector is wrong in the end :)

System Info

`next-f8e91f00aa`

Additional context

No response

Validations

  • Read the Contributing Guidelines.
  • Read the docs.
  • Check that there isn't already an issue that reports the same bug to avoid creating a duplicate.
  • Checked that the bug isn't fixed in the next version already [https://woodpecker-ci.org/faq#which-version-of-woodpecker-should-i-use]
  • Check that this is a concrete bug. For Q&A join our Discord Chat Server or the Matrix room.
@pat-s pat-s added the bug Something isn't working label Sep 27, 2023
@pat-s
Copy link
Contributor Author

pat-s commented Sep 28, 2023

I can confirm that the issue is not present in v1.0.2.

@pat-s
Copy link
Contributor Author

pat-s commented Oct 4, 2023

@6543 @zc-devs This is somewhat important and should be fixed before the next release or reverted (if no one has time). A response would be appreciated :)

@pat-s
Copy link
Contributor Author

pat-s commented Oct 4, 2023

Given that 1.0.2 works and hence everything in main until July 29th should be fine, I suspect #2249 to be the issue.

After testing: jup, main until 3954d85 and the arch/nodeSelector mapping LIKELY broke with #2249.

@klinux Are you able to look into this and fix the code? I'd revert the PR until it's fixed.

@zc-devs
Copy link
Contributor

zc-devs commented Oct 4, 2023

Is your WP server and agent run on arm64? And you want to run step on amd64 by setting platform in the matrix , right? Does it work, if you set it in a step directly rather then in the matrix?

nodeSelector is always set to CI_SYSTEM_PLATFORM and does not honor platform env anymore?

nodeSelector was always set from environment variable. It used to be CI_SYSTEM_ARCH. Then it became CI_SYSTEM_PLATFORM.

@pat-s
Copy link
Contributor Author

pat-s commented Oct 4, 2023

Is your WP server and agent run on arm64?

Agent runs on one (random assignment), agent runs on both (one of each).

And you want to run step on amd64 by setting platform in the matrix , right?

Yes, that worked (at least) until #2249. See OP. It might be that another commit broke it but I haven't done further bisecting yet and just looked at the history of pod.go.

Does it work, if you set it in a step directly rather then in the matrix?

I didn't test this but it doesn't matter much for the issue at hand as the matrix assignments as shown in the OP should work. So essentially, the mapping of platform must be honored (if defined) and carried over to the NodeSelector.

@klinux
Copy link
Contributor

klinux commented Oct 4, 2023

@pat-s my PR do nothing with nodeselector, toleration has no impact with nodeSelector, do you have any evidence about that?

@pat-s
Copy link
Contributor Author

pat-s commented Oct 4, 2023

Nope, not really. You're right. My bad! I was too fast judging on the history...the issue must be somewhere else.

@qwerty287
Copy link
Contributor

qwerty287 commented Oct 4, 2023

@pat-s The platform key in YAMLs is not supported anymore and was removed in #2480. Use labels instead: https://woodpecker-ci.org/docs/usage/pipeline-syntax#labels

(I don't know if this is the underlying issue of this as I don't use kubernetes)

@zc-devs
Copy link
Contributor

zc-devs commented Oct 4, 2023

I've done some tests on 61b5672.

Pipeline:

skip_clone: true
platform: linux/mips
steps:
  test:
    image: alpine
    commands:
      - echo Hello from test

Agent on server:

update agents set platform = 'linux/mips';

Then agent doesn't pick up job.

Then I changed labels on agent:

WOODPECKER_FILTER_LABELS: platform=linux/mips

And agent runs job successfully. But I have no platform env var on the pod.

Also I have CI_SYSTEM_PLATFORM=linux/amd64. And it looks right, because it is set up from Agent's node platform.

I guess, that your agent wrongly picks up job that intended for other agent.

It's difficult to test as I have no different platforms. Maybe labels will help you somehow (@qwerty287 has mentioned above).

@pat-s
Copy link
Contributor Author

pat-s commented Oct 4, 2023

Ahhh! I missed this commit...
This likely seems to be the issue.

Previously I had

platform: ${platform}

matrix:
  include:
    - DISTRIBUTION: ubuntu
      platform: linux/amd64

which - before #2480 then - resulted in correct agent assignments as the platform from the matrix assignment was populated to the top level platform key and then to the NodeSelector of the respective run.
Now - presumably after #2480 - this results in random agent assignments now.

However just replacing this with

labels:
  platform: ${platform}

doesn't do it (faulty YAML).

Also using

matrix:
  include:
    - DISTRIBUTION: ubuntu
      CI_SYSTEM_PLATFORM: linux/amd64

results in the wrong assignment.

Also the documentation here also needs an update: https://woodpecker-ci.org/docs/next/usage/matrix-workflows#example-matrix-pipeline-using-multiple-platforms

Is it actually still possible to set the arch within a matrix statement in current HEAD? I've tried many things and looked into the code but AFAICS CI_SYSTEM_PLATFORM is always overwritten in the respective pod env and the value set in the matrix definition never honored 🤔 - i.e. it is always set to linux/amd64 even though I explicitly set it to arm64 via

matrix:
  include:
    - DISTRIBUTION: ubuntu
      platform: linux/arm64

@pat-s
Copy link
Contributor Author

pat-s commented Oct 5, 2023

I've figured it out. The matrix notation should look like this

matrix:
  include:
    - DISTRIBUTION: ubuntu
      ARCH: arm64

where ARCH is just an arbitrary key name.

Then, in the pipeline, kubernetes.io/arch must be set in the steps backend_options as follows:

[...]
    backend_options:
      kubernetes:
        nodeSelector:
          kubernetes.io/arch: "${ARCH}"

The documentation also needs to be updated. Will do...

@zc-devs
Copy link
Contributor

zc-devs commented Oct 11, 2023

So, there is nothing wrong with nodeSelector and Kubernetes backend. Issue should be renamed.

  1. Platform labels work.
  2. Matrix works too:
matrix:
  PLATFORM:
    - linux/amd64
    - windows/arm64

skip_clone: true
steps:
  test:
    image: alpine
    commands:
      - echo Hello from ${PLATFORM}

Screenshot 2023-10-11 160514
3. There is some bug in their combination:

matrix:
  PLATFORM:
    - linux/amd64
    - windows/arm64

labels:
  platform: ${PLATFORM}

skip_clone: true
steps:
  test:
    image: alpine
    commands:
      - echo Hello from ${PLATFORM}

Server log:

{"level":"trace","error":"Cannot unmarshal '<nil>' of type <nil> into a string value","time":"2023-10-11T13:01:30Z","caller":"/woodpecker/server/pipeline/filter.go:69","message":"failed to parse config '.woodpecker.yml'"}

^ version from cbd0c26


Cannot reproduce error with the same pipeline on a95f1bc. @pat-s, test again please.
image
For windows/arm64 pipeline never runs. Looks like it works as intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants