Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality to forcefully kill an instance #2898

Merged
merged 4 commits into from
Sep 5, 2024

Conversation

GrantPSpencer
Copy link
Contributor

@GrantPSpencer GrantPSpencer commented Sep 3, 2024

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

N/A - New feature to forcefully kill an instance

Description

  • Here are some details about my PR, including screenshots of any UI changes:

This PR adds a new feature, a HelixAdmin and Helix-rest API command to forcefully kill an instance. This is achieved by marking the instance's operation as UNKNOWN and then deleting the LIVEINSTANCE znode. This feature is intended for use in a scenario where the participant is in an unrecoverable state but is keeping an active connection with ZK. Marking the node as UNKNOWN will remove it from calculations and subsequently deleting the LIVEINSTANCE znode will cause the controller to consider it as OFFLINE. This skips the requirement that the node must process the downward state transition for topstate handoff to occur.

My current findings indicate that the LIVEINSTANCE znode will only be recreated on ZK session establishment, which occurs on initial connection and after session expiration.

The following code changes were made:

  • helix-core/src/main/java/org/apache/helix/HelixAdmin.java: Added forceKillInstance method to the HelixAdmin interface.
  • helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java: Implemented the forceKillInstance method in the ZKHelixAdmin class.
  • helix-rest/src/main/java/org/apache/helix/rest/server/resources/helix/PerInstanceAccessor.java: Added forceKillInstance command to the to the REST API updateInstance endpoint. Called via:
https://<helix-rest-url>/namespaces/<namespace>/clusters/<cluster>/instances/<instance>?command=forceKillInstance

Also includes miscellaneous changes:

  • helix-core/src/test/java/org/apache/helix/integration/rebalancer/TestInstanceOperation.java Corrected the logger class reference.
  • helix-rest/src/test/java/org/apache/helix/rest/server/TestPartitionAssignmentAPI.java: Corrected the logger class reference.
  • helix-rest/src/test/java/org/apache/helix/rest/server/AbstractTestClass.java: Refactored resource creation logic. Added addParticipant and dropParticipant methods. Also added another test cluster to isolate testPerInstanceAccessor and testInstancesAccessor
  • helix-rest/src/test/java/org/apache/helix/rest/server/TestInstancesAccessor.java: Now using isolated test cluster

Tests

  • The following tests are written for this issue:
  • helix-core/src/test/java/org/apache/helix/integration/TestForceKillInstance.java for HelixAdmin API
$ mvn test -o -Dtest=TestForceKillInstance -pl=helix-core

[INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 41.027 s - in org.apache.helix.integration.TestForceKillInstance
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:16 min
[INFO] Finished at: 2024-09-03T11:37:35-07:00
[INFO] ------------------------------------------------------------------------

  • testForceKillInstance in helix-rest/src/test/java/org/apache/helix/rest/server/TestPerInstanceAccessor.java for Helix-Rest API
$ mvn test -o -Dtest=TestInstancesAccessor -pl=helix-rest  

[INFO] Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 86.395 s - in org.apache.helix.rest.server.TestInstancesAccessor
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:40 min
[INFO] Finished at: 2024-09-03T11:35:04-07:00
[INFO] ------------------------------------------------------------------------

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:
    N/A

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

@GrantPSpencer
Copy link
Contributor Author

Currently triggering manual runs on my fork's CI to confirm no flaky tests

@junkaixue
Copy link
Contributor

@GrantPSpencer ready to checkin?

@GrantPSpencer
Copy link
Contributor Author

Pull request approved by: @junkaixue
Commit message: Add functionality to forcefully kill an instance

@junkaixue junkaixue merged commit 719b722 into apache:master Sep 5, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants