Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix waged instance capacity npe on new resource #2969

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

GrantPSpencer
Copy link
Contributor

@GrantPSpencer GrantPSpencer commented Nov 25, 2024

Issues

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Full description of the NPE can be found in the issue #2891. Here is a brief summary:

  1. _instanceCapacityMap is calculated with WAGED resources in cluster
  2. If you remove all WAGED resources, the previous _instanceCapacityMap is still present. This is only recalculated under certain conditions
  3. If you add a new WAGED resource, the _instanceCapacityMap is not recalculated before it is used. This stale _instanceCapacityMap can lead to an NPE

This PR addresses the above issue by ensuring the _instanceCapacityMap is null whenever there are no WAGED resources in the cluster. This leads to the map being recalculated before it is used when a new WAGED resource is added.

Tests

  • The following tests are written for this issue:
    New test class: TestWagedNPE

  • The following is the result of the "mvn test" command on the appropriate module:

$ mvn test -o -Dtest=TestWagedNPE -pl=helix-core

[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 11.836 s - in org.apache.helix.integration.TestWagedNPE
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  35.815 s
[INFO] Finished at: 2024-11-25T10:09:46-08:00
[INFO] ------------------------------------------------------------------------

(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:
    N/A

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

@@ -356,6 +356,11 @@ void handleResourceCapacityCalculation(ClusterEvent event, ResourceControllerDat
CurrentStateOutput currentStateOutput) {
Map<String, Resource> resourceMap = event.getAttribute(AttributeName.RESOURCES.name());
if (skipCapacityCalculation(cache, resourceMap, event)) {
// Ensure instance capacity is null if there are no resources. This prevents using a stale map when all resources
// are removed and then a new resource is added.
if (resourceMap == null || resourceMap.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we merge the 2 if?

Copy link
Contributor Author

@GrantPSpencer GrantPSpencer Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally attempted to merge the two ifs as if (resourceMap == null || resourceMap.isEmpty()) is basically a redundant check because the same check is made in skipCapacityCalculation

However, I was concerned with putting the responsibility of clearing the cache within skipCapacityCalculation because I did not want to give the method a side effect that may not be obvious to the user.

Do you have any thoughts on how to approach this to simplify the logic?

Copy link
Contributor

@xyuanlu xyuanlu Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (skipCapacityCalculation(cache, resourceMap, event) && (resourceMap == null || resourceMap.isEmpty()) ) {
cache.clearWagedCapacityProviders();
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants