Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Orchestrator as a native Vitess component #6612

Closed
sougou opened this issue Aug 23, 2020 · 4 comments
Closed

RFC: Orchestrator as a native Vitess component #6612

sougou opened this issue Aug 23, 2020 · 4 comments
Assignees
Labels
Component: VTorc Vitess Orchestrator integration Type: Announcement Type: RFC Request For Comment

Comments

@sougou
Copy link
Contributor

sougou commented Aug 23, 2020

Introduction

With #6582, we are starting on a Vitess-specific fork of the Orchestrator that was originally created by @shlomi-noach.

Previously, Vitess provided an integration to loosely cooperate with an install of Orchestrator managing automated failover for the MySQL instances underlying Vitess. However, users often encountered corner cases where the two sides could make conflicting changes, potentially leaving the system in an inconsistent state.

We intend to resolve these issues and provide a smooth and unified experience to the end-user.

Approach

Many approaches were considered:

  1. Build a Vitess-native failover mechanism.
  2. Copy portions of the Orchestrator code into Vitess.
  3. Extend Orchestrator to allow tighter integration and cooperation with Vitess.
  4. Modify Orchestrator to become a first-class Vitess component.

Of the above approaches, we have decided on option 4. We reached this conclusion after studying the Orchestrator code, and realizing that it is written with the intent to allow customization.

Architecturally, the Orchestrator is closest in function to the work performed by vtctld and has some overlapping functionality. For example, stop-replica vs StopReplica, graceful-master-takeover vs PlannedReparentShard.

On the other hand: For massive clusters with >10K nodes, it may be better to divide the Orchestrator's responsibility into specific keyspaces or shards. In this situation, it starts to deviate from the vtctld model. This requires further brainstorming.

To fork or not to fork

We debated extending the Orchestrator to accommodate Vitess into the existing code base, vs. forking the code. The conclusion was to fork due to the following reasons:

  • Model: Orchestrator’s failover code depends on the discovered replication graph that connected replicas to their master. This model is substantially different from Vitess where cluster membership is strongly dictated by the keyspace-shard an instance belongs to.
  • Responsibility: Orchestrator is primarily responsible for performing failovers. However, this separation gave rise to many corner cases, especially when Vitess would wire-up replication using a view of the world that differed from what Orchestrator saw. For this integration to work smoothly, the Orchestrator needs to take responsibility for all functions of the cluster.
  • Opinions: Orchestrator has used a hands-free approach with the intent to accommodate with the desires of how administrators would like to configure their MySQL instances. In Vitess, there is an expectation that a cluster should be mostly self-managed, with certain critical state transitions occurring manually. So, many of the flexibilities that Orchestrator accommodates are not allowed in Vitess.
  • Velocity: If we forked, we can move faster because there is no fear of breaking use cases that are not supported by Vitess.

End Product

The end product could be a unified component that merges Orchestrator and vtctld. We could either preserve the old vtctld name, or give a brand new name for the new binary.

This will extend vtctld’s responsibility to not only execute user commands, but also initiate actions against the cluster to remedy any problems that are automatically solvable.

Also, vtctld is designed to cooperate with other vtctlds present. This is in line with the Orchestrator's philosophy. Although they use different methods right now, they will be unified at the end.

As for the web UI, the Orchestrator has a pleasant and intuitive graphical interface. This will be used as the starting point, and the vtctld functionality will be adapted around this as starting point.

There should be links to vttablets from Orchestrator, similar to how there are links from the vtctld web UI to vttablets today.

Orchestrator also has a more structured CLI interface, which will likely be preferred over the one that vtctld organically grew.

If we decide to not merge with vtctld, we'll have to consider either keeping Orchestrator as a separate component, or see what it will look like if it ran inside vttablets.

Pluggable Durability

Users come with varying tolerances on durability. Additionally, cloud environments define complex topologies. For example, AWS has zones and regions. In order to meet these requirements, and future-proof ourselves, we will implement a pluggable durability policy. This will be based on an extension of FlexPaxos.

Essentially, this pluggability will allow us to address all existing known use cases as well as future ones we have not heard of. More details will follow.

Details

This is a preliminary drill-down of what we think is needed. So, it’s presented as somewhat of a laundry list. We will refine this as we move forward.

MVP

A Minimum Viable Product is being developed. The latest code is currently in https://github.com/planetscale/vitess/tree/ss-oc2-vt-mode. It’s likely to evolve. Those who wish to preview upcoming work can look for branches in the PlanetScale repo that are prefixed with ss-oc. At this point, the MVP has the following functionality:

  • Automatically discover all tablets and MySQL instances by walking the Vitess topology.
  • Assign default rules for discovered items:
  • Promotion rules
  • DataCenters
  • Automatically initialize a brand new cluster (replacing InitShardMaster).
  • When a new vttablet comes up with its MySQL, wire it up to replicate from the current master.
  • GracefulTakeOver performs the same function as PlannedReparentShard
  • Dead-master recovery, with Orchestrator directly performing the TabletExternallyReparented work. This still does not include everything that EmergencyReparentShard does today, but should be extended.
  • Heal the cluster if tablets and MySQL instances do not agree on who is the master
  • Ensure masters and replicas have the correct read/write permissions and ensure they are connected correctly.
  • Discovery scans per cell and handles the case of a whole cell being down.

The MVP is only a starting point. There is more work that needs to be done:

Code culture

  • Port existing integration tests, make them pass and make them part of CI/CD
  • Golint, govet, etc: These are currently failing, and need to be fixed.
  • Write new tests to meet Vitess coding standards (coverage, etc).
  • Context: Orchestrator does not use Golang contexts. But Vitess calls need them. So, we need to make sure we wire up all the TODO contexts.
  • Change to use Vitess code infrastructure: servenv, logging, etc.

Metadata

  • Most of the metadata state internally maintained by Orchestrator can be automatically reconstructed through Vitess’ discovery * functionality. However, there are a few settings that need to be persisted. For example, states like Maintenance and Downtime * * Windows. These should be moved into the Vitess topo.

Improvements

  • Use Vitess GTID code locally to compare GTIDs instead of calling into MySQL.
  • Some retries may need an exponential back-off.
  • Use persistent connections to tabletmanager to avoid frequent reconnects.

Changes in behavior

  • DeadMaster: Currently uses file:pos to find the most progressed replica. This will not work if replicas can be detached. This should be changed to use GTID comparison.
  • DeadMaster: Should use all instances of a cluster instead of just the ones in the replication graph.
  • Topology Recovery feeds into Vitess’s event log.
  • Enhanced security: Orchestrator should receive replication credentials from vttablets.

New functionality

  • Ability to mark a cell as down.
  • Subscribe to Tablet streaming health to get real-time updates on changes in tablet states.
  • Let vttablet’s streaming health provide all the MySQL info instead of Orchestrator fetching from two sources?
  • Watch vttablets and reparent on their failure (even if MySQL is up).
  • Support Multi-schema: multiple vttablets pointed at the same MySQL instance.
  • Rewind/backtrack: Detect and fix extraneous GTIDs from recovered masters.
  • Tags: Publish discoverable tags to the vttablets that will allow vtgates to change their routings accordingly; instead of fixed tablet types.
  • Pluggable durability
  • Go through vttablet for all actions so everything is uniform
  • Look at moving the recover logic inside vttablet to better conform to the consensus doc
@shlomi-noach
Copy link
Contributor

Adding:

  • Leadership: today, orchestrator itself is HA with either shared MySQL backend or raft consensus protocol to support leadership. In vitess the back end will be Global Topo.
    Global topo will implicitly dictate orchestrator leadership in face of network isolation.
  • Controlled leadership: assuming stable and healthy topo, we want to be able to ask for a specific orchestrator node to become the leader. For example, it's common in many setups to have one cell (cloud vendor? AZ?) to be "more active" than the others. e.g. a Cell where more master servers are located; reasons are typically to get low latency with local apps/clients. In such scenario it is also advisable, though not stricctly required, to have the orchestrator leader in same Cell.
  • Scaling: orchestrator will use a SQLite backend. SQLite does not allow any concurrency. the number of interactions from orchestrator to its backend DB is linear with the number of probed servers. Thus, at some point, the database becomes a bottleneck. Somewhere in the thousands this should become a problem (with MySQL backend, orchestrator is known to be able to sustain more than that). However, good news is that we may not actually need persistence. The vitess-orc merger allows orchestrator to collect almost all information while probing the servers (e.g. promotion-rules). Information like downtime can either persist to Global Topo or be collected from other vtctld/orchestrator nodes upon startup. This means we can use SQLite with file::memory:?cache=shared setting, which makes it fully in-mem, and should provide more capacity. I haven't tried this at scale as yet.

@GuptaManan100 GuptaManan100 added Component: VTorc Vitess Orchestrator integration Type: RFC Request For Comment labels Jan 30, 2022
@GuptaManan100
Copy link
Member

GuptaManan100 commented Mar 31, 2022

Adding the TODO list for tracking -

  • Logging to use vt/log
  • serenv package use and populate debug/vars, etc
  • Disable/Enable Global recoveries
  • Go context usage
  • Port integration tests
  • Code coverage tests
  • Unused code cleanup
  • Support vttablet failures
  • Durability policy work tracked in RFC - Durability and consensus in Vtorc #8975
  • Subscribe to tablet health-checks
  • Parameterize timeouts & intervals
  • route queries through vttablets
  • Federation
  • Find alternate for AttemptRecoveryRegistration and throttling
  • Remove replication manager
  • API and Config cleanup
  • Audit Topology Recovery
  • VTOrc with backups

@rbranson
Copy link
Contributor

rbranson commented Mar 31, 2022

  • Scaling: orchestrator will use a SQLite backend. SQLite does not allow any concurrency. the number of interactions from orchestrator to its backend DB is linear with the number of probed servers. Thus, at some point, the database becomes a bottleneck. Somewhere in the thousands this should become a problem (with MySQL backend, orchestrator is known to be able to sustain more than that). However, good news is that we may not actually need persistence. The vitess-orc merger allows orchestrator to collect almost all information while probing the servers (e.g. promotion-rules). Information like downtime can either persist to Global Topo or be collected from other vtctld/orchestrator nodes upon startup. This means we can use SQLite with file::memory:?cache=shared setting, which makes it fully in-mem, and should provide more capacity. I haven't tried this at scale as yet.

If we still want to keep the persistence, SQLite scales quite well in WAL mode: https://sqlite.org/wal.html which just needs to be enabled. It still only allows one writer at a time, but it allows reads to occur concurrently with any other operations. It eliminates the "database is busy" errors which plague anyone trying to use SQLite from multiple threads or processes. The in-memory mode will produce fewer locked/busy errors, but it still has a coarse-grained locking model so it can still produce those errors. WAL mode uses snapshot isolation to avoid them. That also means if we're explicitly or implicitly dependent on the very strict isolation default of SQLite, we may need to use more explicit concurrency management (i.e. SELECT FOR UPDATE).

Edit: looks like Orchestrator already does this! https://github.com/openark/orchestrator/blob/de1b1ecd3f65cac447b24067d99dc56a8109fd82/go/db/db.go#L354

WAL mode scales very well, it can scale to hundreds of thousands of processes and millions of reads/s, but maybe not at the same time? :D

@shlomi-noach
Copy link
Contributor

Edit: looks like Orchestrator already does this!

It does!

@frouioui frouioui added this to v16.0.0 Dec 6, 2022
@frouioui frouioui moved this to Backlog in v16.0.0 Dec 6, 2022
@frouioui frouioui moved this from Backlog to In Progress in v16.0.0 Dec 6, 2022
@frouioui frouioui added this to v17.0.0 Feb 8, 2023
@frouioui frouioui removed this from v16.0.0 Feb 8, 2023
@frouioui frouioui moved this to In Progress in v17.0.0 Feb 8, 2023
@frouioui frouioui added this to v18.0.0 Jun 30, 2023
@frouioui frouioui moved this to In Progress in v18.0.0 Jun 30, 2023
@frouioui frouioui removed this from v17.0.0 Jun 30, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in v18.0.0 Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VTorc Vitess Orchestrator integration Type: Announcement Type: RFC Request For Comment
Projects
Status: Done
Development

No branches or pull requests

4 participants