RFC: Orchestrator as a native Vitess component #6612

sougou · 2020-08-23T16:46:17Z

Introduction

With #6582, we are starting on a Vitess-specific fork of the Orchestrator that was originally created by @shlomi-noach.

Previously, Vitess provided an integration to loosely cooperate with an install of Orchestrator managing automated failover for the MySQL instances underlying Vitess. However, users often encountered corner cases where the two sides could make conflicting changes, potentially leaving the system in an inconsistent state.

We intend to resolve these issues and provide a smooth and unified experience to the end-user.

Approach

Many approaches were considered:

Build a Vitess-native failover mechanism.
Copy portions of the Orchestrator code into Vitess.
Extend Orchestrator to allow tighter integration and cooperation with Vitess.
Modify Orchestrator to become a first-class Vitess component.

Of the above approaches, we have decided on option 4. We reached this conclusion after studying the Orchestrator code, and realizing that it is written with the intent to allow customization.

Architecturally, the Orchestrator is closest in function to the work performed by vtctld and has some overlapping functionality. For example, stop-replica vs StopReplica, graceful-master-takeover vs PlannedReparentShard.

On the other hand: For massive clusters with >10K nodes, it may be better to divide the Orchestrator's responsibility into specific keyspaces or shards. In this situation, it starts to deviate from the vtctld model. This requires further brainstorming.

To fork or not to fork

We debated extending the Orchestrator to accommodate Vitess into the existing code base, vs. forking the code. The conclusion was to fork due to the following reasons:

Model: Orchestrator’s failover code depends on the discovered replication graph that connected replicas to their master. This model is substantially different from Vitess where cluster membership is strongly dictated by the keyspace-shard an instance belongs to.
Responsibility: Orchestrator is primarily responsible for performing failovers. However, this separation gave rise to many corner cases, especially when Vitess would wire-up replication using a view of the world that differed from what Orchestrator saw. For this integration to work smoothly, the Orchestrator needs to take responsibility for all functions of the cluster.
Opinions: Orchestrator has used a hands-free approach with the intent to accommodate with the desires of how administrators would like to configure their MySQL instances. In Vitess, there is an expectation that a cluster should be mostly self-managed, with certain critical state transitions occurring manually. So, many of the flexibilities that Orchestrator accommodates are not allowed in Vitess.
Velocity: If we forked, we can move faster because there is no fear of breaking use cases that are not supported by Vitess.

End Product

The end product could be a unified component that merges Orchestrator and vtctld. We could either preserve the old vtctld name, or give a brand new name for the new binary.

This will extend vtctld’s responsibility to not only execute user commands, but also initiate actions against the cluster to remedy any problems that are automatically solvable.

Also, vtctld is designed to cooperate with other vtctlds present. This is in line with the Orchestrator's philosophy. Although they use different methods right now, they will be unified at the end.

As for the web UI, the Orchestrator has a pleasant and intuitive graphical interface. This will be used as the starting point, and the vtctld functionality will be adapted around this as starting point.

There should be links to vttablets from Orchestrator, similar to how there are links from the vtctld web UI to vttablets today.

Orchestrator also has a more structured CLI interface, which will likely be preferred over the one that vtctld organically grew.

If we decide to not merge with vtctld, we'll have to consider either keeping Orchestrator as a separate component, or see what it will look like if it ran inside vttablets.

Pluggable Durability

Users come with varying tolerances on durability. Additionally, cloud environments define complex topologies. For example, AWS has zones and regions. In order to meet these requirements, and future-proof ourselves, we will implement a pluggable durability policy. This will be based on an extension of FlexPaxos.

Essentially, this pluggability will allow us to address all existing known use cases as well as future ones we have not heard of. More details will follow.

Details

This is a preliminary drill-down of what we think is needed. So, it’s presented as somewhat of a laundry list. We will refine this as we move forward.

MVP

A Minimum Viable Product is being developed. The latest code is currently in https://github.com/planetscale/vitess/tree/ss-oc2-vt-mode. It’s likely to evolve. Those who wish to preview upcoming work can look for branches in the PlanetScale repo that are prefixed with ss-oc. At this point, the MVP has the following functionality:

Automatically discover all tablets and MySQL instances by walking the Vitess topology.
Assign default rules for discovered items:
Promotion rules
DataCenters
Automatically initialize a brand new cluster (replacing InitShardMaster).
When a new vttablet comes up with its MySQL, wire it up to replicate from the current master.
GracefulTakeOver performs the same function as PlannedReparentShard
Dead-master recovery, with Orchestrator directly performing the TabletExternallyReparented work. This still does not include everything that EmergencyReparentShard does today, but should be extended.
Heal the cluster if tablets and MySQL instances do not agree on who is the master
Ensure masters and replicas have the correct read/write permissions and ensure they are connected correctly.
Discovery scans per cell and handles the case of a whole cell being down.

The MVP is only a starting point. There is more work that needs to be done:

Code culture

Port existing integration tests, make them pass and make them part of CI/CD
Golint, govet, etc: These are currently failing, and need to be fixed.
Write new tests to meet Vitess coding standards (coverage, etc).
Context: Orchestrator does not use Golang contexts. But Vitess calls need them. So, we need to make sure we wire up all the TODO contexts.
Change to use Vitess code infrastructure: servenv, logging, etc.

Metadata

Most of the metadata state internally maintained by Orchestrator can be automatically reconstructed through Vitess’ discovery * functionality. However, there are a few settings that need to be persisted. For example, states like Maintenance and Downtime * * Windows. These should be moved into the Vitess topo.

Improvements

Use Vitess GTID code locally to compare GTIDs instead of calling into MySQL.
Some retries may need an exponential back-off.
Use persistent connections to tabletmanager to avoid frequent reconnects.

Changes in behavior

DeadMaster: Currently uses file:pos to find the most progressed replica. This will not work if replicas can be detached. This should be changed to use GTID comparison.
DeadMaster: Should use all instances of a cluster instead of just the ones in the replication graph.
Topology Recovery feeds into Vitess’s event log.
Enhanced security: Orchestrator should receive replication credentials from vttablets.

New functionality

Ability to mark a cell as down.
Subscribe to Tablet streaming health to get real-time updates on changes in tablet states.
Let vttablet’s streaming health provide all the MySQL info instead of Orchestrator fetching from two sources?
Watch vttablets and reparent on their failure (even if MySQL is up).
Support Multi-schema: multiple vttablets pointed at the same MySQL instance.
Rewind/backtrack: Detect and fix extraneous GTIDs from recovered masters.
Tags: Publish discoverable tags to the vttablets that will allow vtgates to change their routings accordingly; instead of fixed tablet types.
Pluggable durability
Go through vttablet for all actions so everything is uniform
Look at moving the recover logic inside vttablet to better conform to the consensus doc

The text was updated successfully, but these errors were encountered:

shlomi-noach · 2020-08-24T05:53:40Z

Adding:

Leadership: today, orchestrator itself is HA with either shared MySQL backend or raft consensus protocol to support leadership. In vitess the back end will be Global Topo.
Global topo will implicitly dictate orchestrator leadership in face of network isolation.
Controlled leadership: assuming stable and healthy topo, we want to be able to ask for a specific orchestrator node to become the leader. For example, it's common in many setups to have one cell (cloud vendor? AZ?) to be "more active" than the others. e.g. a Cell where more master servers are located; reasons are typically to get low latency with local apps/clients. In such scenario it is also advisable, though not stricctly required, to have the orchestrator leader in same Cell.
Scaling: orchestrator will use a SQLite backend. SQLite does not allow any concurrency. the number of interactions from orchestrator to its backend DB is linear with the number of probed servers. Thus, at some point, the database becomes a bottleneck. Somewhere in the thousands this should become a problem (with MySQL backend, orchestrator is known to be able to sustain more than that). However, good news is that we may not actually need persistence. The vitess-orc merger allows orchestrator to collect almost all information while probing the servers (e.g. promotion-rules). Information like downtime can either persist to Global Topo or be collected from other vtctld/orchestrator nodes upon startup. This means we can use SQLite with file::memory:?cache=shared setting, which makes it fully in-mem, and should provide more capacity. I haven't tried this at scale as yet.

GuptaManan100 · 2022-03-31T07:53:20Z

rbranson · 2022-03-31T16:09:07Z

Scaling: orchestrator will use a SQLite backend. SQLite does not allow any concurrency. the number of interactions from orchestrator to its backend DB is linear with the number of probed servers. Thus, at some point, the database becomes a bottleneck. Somewhere in the thousands this should become a problem (with MySQL backend, orchestrator is known to be able to sustain more than that). However, good news is that we may not actually need persistence. The vitess-orc merger allows orchestrator to collect almost all information while probing the servers (e.g. promotion-rules). Information like downtime can either persist to Global Topo or be collected from other vtctld/orchestrator nodes upon startup. This means we can use SQLite with file::memory:?cache=shared setting, which makes it fully in-mem, and should provide more capacity. I haven't tried this at scale as yet.

If we still want to keep the persistence, SQLite scales quite well in WAL mode: https://sqlite.org/wal.html which just needs to be enabled. It still only allows one writer at a time, but it allows reads to occur concurrently with any other operations. It eliminates the "database is busy" errors which plague anyone trying to use SQLite from multiple threads or processes. The in-memory mode will produce fewer locked/busy errors, but it still has a coarse-grained locking model so it can still produce those errors. WAL mode uses snapshot isolation to avoid them. That also means if we're explicitly or implicitly dependent on the very strict isolation default of SQLite, we may need to use more explicit concurrency management (i.e. SELECT FOR UPDATE).

Edit: looks like Orchestrator already does this! https://github.com/openark/orchestrator/blob/de1b1ecd3f65cac447b24067d99dc56a8109fd82/go/db/db.go#L354

WAL mode scales very well, it can scale to hundreds of thousands of processes and millions of reads/s, but maybe not at the same time? :D

shlomi-noach · 2022-03-31T17:11:20Z

Edit: looks like Orchestrator already does this!

It does!

sougou mentioned this issue Aug 23, 2020

orc: vitess mode #6613

Merged

GuptaManan100 added the Type: Announcement label Nov 30, 2020

GuptaManan100 added Component: VTorc Vitess Orchestrator integration Type: RFC Request For Comment labels Jan 30, 2022

This was referenced Jun 15, 2022

Adds RPCs to vttablet that vtorc requires #10464

Merged

Use introduced tablet manager RPCs in VTOrc #10467

Merged

Backport to release-14.0: Adds RPCs to vttablet that vtorc requires (#10464) #10546

Merged

This was referenced Jul 7, 2022

VTOrc Cleanup: Remove KV stores #10645

Merged

Use TMC RPCs in VTOrc #10664

Merged

GuptaManan100 mentioned this issue Jul 18, 2022

Deprecate enable-semi-sync in favour of RPC parameter #10695

Merged

3 tasks

GuptaManan100 mentioned this issue Sep 9, 2022

Cluster-Alias cleanup for VTOrc #11193

Merged

3 tasks

This was referenced Sep 20, 2022

Parameterize VTOrc constants #11254

Merged

Introduce servenv status pages in VTOrc #11263

Merged

Addition of Metrics to VTOrc to track the number of recoveries ran and their success count. #11338

Merged

GuptaManan100 mentioned this issue Oct 14, 2022

Orchestrator Integration Removal and orc_client_user removal #11503

Merged

3 tasks

GuptaManan100 mentioned this issue Oct 21, 2022

[15.0] Deprecate InitShardPrimary command #11557

Merged

3 tasks

GuptaManan100 mentioned this issue Nov 2, 2022

[ForwardPort] Deprecate InitShardPrimary command (#11557) #11620

Merged

3 tasks

frouioui added this to v16.0.0 Dec 6, 2022

frouioui moved this to Backlog in v16.0.0 Dec 6, 2022

frouioui moved this from Backlog to In Progress in v16.0.0 Dec 6, 2022

frouioui assigned GuptaManan100 Dec 6, 2022

GuptaManan100 mentioned this issue Dec 23, 2022

VTOrc Code Cleanup - generate_base, replace cluster_name with keyspace and shard. #12012

Merged

3 tasks

frouioui added this to v17.0.0 Feb 8, 2023

frouioui removed this from v16.0.0 Feb 8, 2023

frouioui moved this to In Progress in v17.0.0 Feb 8, 2023

frouioui added this to v18.0.0 Jun 30, 2023

frouioui moved this to In Progress in v18.0.0 Jun 30, 2023

frouioui removed this from v17.0.0 Jun 30, 2023

GuptaManan100 closed this as completed Jul 24, 2024

github-project-automation bot moved this from In Progress to Done in v18.0.0 Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Orchestrator as a native Vitess component #6612

RFC: Orchestrator as a native Vitess component #6612

sougou commented Aug 23, 2020

shlomi-noach commented Aug 24, 2020

GuptaManan100 commented Mar 31, 2022 •

edited

Loading

rbranson commented Mar 31, 2022 •

edited

Loading

shlomi-noach commented Mar 31, 2022

RFC: Orchestrator as a native Vitess component #6612

RFC: Orchestrator as a native Vitess component #6612

Comments

sougou commented Aug 23, 2020

Introduction

Approach

To fork or not to fork

End Product

Pluggable Durability

Details

MVP

Code culture

Metadata

Improvements

Changes in behavior

New functionality

shlomi-noach commented Aug 24, 2020

GuptaManan100 commented Mar 31, 2022 • edited Loading

rbranson commented Mar 31, 2022 • edited Loading

shlomi-noach commented Mar 31, 2022

GuptaManan100 commented Mar 31, 2022 •

edited

Loading

rbranson commented Mar 31, 2022 •

edited

Loading