Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node maintenance mode #2

Closed
mvgorbunov opened this issue Apr 1, 2024 · 6 comments
Closed

Node maintenance mode #2

mvgorbunov opened this issue Apr 1, 2024 · 6 comments

Comments

@mvgorbunov
Copy link

mvgorbunov commented Apr 1, 2024

Support single node maintenance mode - command like ydbops node maintenance --host <node_fqdn> [--user USER --ttl REQUEST_TTL --prepare-node] ...
Maintenance mode in general means that the node will NOT be prepared (NO tablets and sessions drain, NO moving out bs-groups, etc) if --prepare-node not specified - only check current cluster state and lock the node in CMS.
--prepare option is supported from 24-1.
We need to warn users that using this mode might be destructive and they should use it for non-destructive operation (e.g. replacing disk that had already been failed or reboot server)

@Jorres
Copy link
Contributor

Jorres commented May 7, 2024

Discussed with @pixcc and came to conclusion:

  1. the node maintenance (as opposed to host maintenance) is not really required
  2. the ydbops maintenance host WILL give the caller back a task ID (string identifier), because:
    2.1) there is a mechanism for priority in CMS. You can schedule a request to CMS with high priority, and even if the node from this request can be given away, they won't be given to anyone with lower priority.
    2.2) with --prepare functionality, the user WILL need to wait until the host has been de-populated before taking out the host. So the task WILL have to exist in CMS for some time, in non-completed state.

Assuming that the initial ydbops maintenance host is invoked with --host-fqdn, we tried to think of a way to NOT give the user the task id - maybe ydbops will be able to determine the task from the same --host-fqdn later, when the user comes with the next command ydbops maintenance [refresh|drop|complete]. But a lot of problems quickly showed up.

  1. what if there are multiple tasks which request the same host? to which of them should the operation be applied? to the current one?
  2. what if the current one was not created by you? should you be allowed to modify other user's tasks? (example users: walle, rolling restart, infra on call)
  3. finally, K8s. It is impossible to find out which tasks were created for your --host-fqdn, because even if you list all the tasks from CMS, they will have pod internal fqdns, not the external fqdn that the user gave you, and mapping can be ambiguous.

A quick schematic (basically just for me):
image

@Jorres
Copy link
Contributor

Jorres commented May 10, 2024

Discussed with @mvgorbunov:

  1. ydbops maintenance host -> ydbops maintenance create
  2. create and complete operate with different entities: host-fqdn and task-id respectively, but it is impossible to implement otherwise (complete --host-fqdn does not give enough information)
  3. even if user didn't take a note of his task-id when ydbops maintenance create was called, it is still possible to ydbops maintenance list tasks and try to select what was his, at least based on the username.

@Jorres
Copy link
Contributor

Jorres commented Nov 7, 2024

Feature has been ready for some time,

❯ ydbops maintenance --help
ydbops maintenance [command]:
    Manage host maintenance operations: request and return hosts
    with performed maintenance back to the cluster.

Usage: ydbops [global options...] maintenance [options] <subcommand> 

Subcommands:
maintenance            Request hosts from the Cluster Management System
├─ complete            Declare the maintenance task completed
├─ create              Create a maintenance task to obtain a set of hosts
├─ drop                Drop an existing maintenance task
├─ list                List all existing maintenance tasks
└─ refresh             Try to obtain previously reserved hosts

Global options: 
  {-e|--endpoint}, --grpc-timeout-seconds, --grpc-skip-verify, --ca-file, --user, --password-file, --no-password, --token-file, --sa-key-file, --iam-endpoint, --use-metadata-credentials, --profile, --profile-file
, {-v|--verbose}
  To get full description of these options run 'ydbops --help'.

Use "ydbops maintenance [command] --help" for more information about a command.

TODO: documentation :)

@pixcc
Copy link
Member

pixcc commented Nov 20, 2024

I added some docs to the article about CMS.

ydb-platform/ydb#11793

@Jorres
Copy link
Contributor

Jorres commented Dec 9, 2024

Thanks! I just recently merged a lot of maintenance subcommand fixes (#44), I consider this feature done

(I'm currently preparing a PR to ydbops docs, but since you added some docs to the article about CMS, it seems I only have to make the ydbops reference up-to-date)

@Jorres Jorres closed this as completed Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants