Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing minimum number of required connections to 0 for bulk,state,rec… #13111

Closed
wants to merge 1 commit into from

Conversation

gashutos
Copy link
Contributor

@gashutos gashutos commented Apr 8, 2024

Description

We need to allow particular channels/connections to be 0 in some cases.
Like,

  1. For dedicated coordinator node set up, coordinator nodes doesn;t require recovery channels
  2. Data/coordinator nodes doesn't have to initialize for state channels.
  3. Dedicated master nodes doesn't require bulk in case they are not put up for coordinator nodes.

But other channels like regular & ping requres at least a connection/channel so not modifying them.

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…overy channels

Signed-off-by: Chaitanya Gohel <gashutos@amazon.com>
Copy link
Contributor

github-actions bot commented Apr 8, 2024

Compatibility status:

Checks if related components are compatible with change 0e5a177

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/sql.git]

Copy link
Contributor

github-actions bot commented Apr 8, 2024

✅ Gradle check result for 0e5a177: SUCCESS

Copy link

codecov bot commented Apr 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.40%. Comparing base (b15cb0c) to head (0e5a177).
Report is 139 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #13111      +/-   ##
============================================
- Coverage     71.42%   71.40%   -0.02%     
- Complexity    59978    60430     +452     
============================================
  Files          4985     5026      +41     
  Lines        282275   284525    +2250     
  Branches      40946    41215     +269     
============================================
+ Hits         201603   203172    +1569     
- Misses        63999    64534     +535     
- Partials      16673    16819     +146     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@peternied peternied left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on why we need to make this change, what does this impact? (I don't see a linked issued)

Lets add some tests to verify the expected behavior - IMO this looks like integration tests / end to end tests are needed

@gashutos
Copy link
Contributor Author

I'm not clear on why we need to make this change, what does this impact? (I don't see a linked issued)

Lets add some tests to verify the expected behavior - IMO this looks like integration tests / end to end tests are needed

@peternied This is just for validation at settings level, any int setting is initialized with along with its default value and minimum expected value. So w.r.t the cluster formation criteria, few connections might have 0 number of connections.

@peternied
Copy link
Member

@gashutos I still don't understand why does the setting need updating?

@gashutos
Copy link
Contributor Author

@gashutos I still don't understand why does the setting need updating?

We are not updating setting value here.

If you see below snippet, the 2nd argument is the setting value, while 3rd argument is the minimum value for that setting in case someone sets the setting value to 0, this validation will fail.
We are just relaxing this validation value to 0 here, but keeping setting value to 2 only.

public static final Setting<Integer> CONNECTIONS_PER_NODE_RECOVERY = intSetting(
        "transport.connections_per_node.recovery",
        2,
        1,
        Setting.Property.NodeScope
    );

@peternied
Copy link
Member

@gashutos You are explaining what you did - this I understand, thank you. I am not familiar with these settings in this part of the project, imagine I just started working alongside you on OpenSearch - how would you explain the reason behind making this change?

@gashutos
Copy link
Contributor Author

gashutos commented Apr 16, 2024

@gashutos You are explaining what you did - this I understand, thank you. I am not familiar with these settings in this part of the project, imagine I just started working alongside you on OpenSearch - how would you explain the reason behind making this change?

Oh yes, sure :)
So to start with, currently there is limitation on number of maximum nodes OpenSearch can support, i.e 200 is what AWS OpenSearch managed service supports, definitely there are cases where we have 400, 600 number of nodes and things are working fine but as we grow with number of nodes let say 1000, there are things start behaving bottlenecks like master (single master for all 1000+ nodes), connections (we have mesh topology where all nodes connect with rest of other nodes in cluster with multiple channels/connections).

This change is towards the step where we are looking to reduce chattiness between nodes for current mesh topolofy by reducing number of connections.
These settings implies connection profile which gets used to initiate connection with other nodes in cluster.
like

  1. Recovery -> 2 connections
  2. Bulk -> 3 conections
  3. Regular -> 6 connection
  4. Ping -> 1 connection
  5. State -> 1 connection.

So we create total of 13 connections between each pair of nodes when the cluster bootstrap happens or OpenSearch process restarts. And for 2000 node cluster, we end up creating 26000 connections between 2 nodes and that will happen for all 2000 nodes with mesh.
And we are trying to reduce it.

Also these numbers were decided way back (last change was somewhere in 2014 !), so with OpenSearch/ElasticSearch product evolution, we have dedicated coordinator nodes to support scaling in Indexing/Search, remote store to reduce replication between nodes and other chattiness like recovery from remote, etc.

With this change, someone who want to support large number of nodes, and if they decide to use dedicated coordinator nodes (which is recommended to scale indexing/search), they can have luxury to set recovery connections to 0 for dedicated coordinator nodes since they dont hosts any shards and hence recoveries wont happen on those nodes. Same things go for master.

Here respective settings, the minimum values I am changing in here to 0 to give flexibility for users.

@peternied
Copy link
Member

@gashutos Thanks - that was exactly what I was looking for,

I think we should have a change log entry - as well as documentation update associated with this change (that might be tiny).

For the change log entry (and this pull request title) what do you think of the following?
[Cluster Scaling] Lower the minimum number of required connection per node

@rramachand21
Copy link
Member

@gashutos what version of opensearch are we targetting this for? can we fill in the assignees, labels, projects etc

@gashutos gashutos self-assigned this Apr 17, 2024
@gashutos gashutos closed this May 1, 2024
@gashutos
Copy link
Contributor Author

gashutos commented May 1, 2024

Not required, closing...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants