Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate OSS to the new scheduler #10021

Closed
5 tasks
benmoriceau opened this issue Feb 2, 2022 · 1 comment · Fixed by #12757
Closed
5 tasks

Migrate OSS to the new scheduler #10021

benmoriceau opened this issue Feb 2, 2022 · 1 comment · Fixed by #12757
Assignees
Labels
area/platform issues related to the platform needs-triage team/platform-move type/enhancement New feature or request

Comments

@benmoriceau
Copy link
Contributor

benmoriceau commented Feb 2, 2022

Tell us about the problem you're trying to solve

A new scheduler has been implemented and added to the cloud project. The migration ended up needing several manual intervention to unstuck the jobs.

Describe the solution you’d like

###Migration of the new job

Problem

  • Currently the migration is apply or not through a feature flag. This feature flag is set using an environment variable which is leading to some issues. The issue is that airbyte application do not start in the same time which can lead to some unexpected state like having the server to try to schedule a new connection using the new scheduler while the worker is still configured to use the old one.

Solution

  • Create a new table that will store the feature flags
  • Add the default value (false) to the table to the table for OSS and true in the cloud project
  • Change the feature flag implementation in order to be base on the DB
  • Change the scheduler implemetation to check if the feature flag is activated during the periodic run instead of the start only.
  • Switch the feature flag value to true

Solution

@lmossman
Copy link
Contributor

lmossman commented Apr 18, 2022

I talked with Benoit about this ticket, here is a summary of what we concluded:

  • If possible, we would like to avoid doing another "faux major version bump" where we require users to upgrade to an intermediate version before upgrading to a later one (the solution laid out above requires this)
  • The steps laid out above are trying to solve for the problem where a new airbyte-server pod is spun up while an old airbyte-scheduler pod is still running. This can only happen if airbyte operators do not turn off Airbyte before upgrading to a new version.
    • However, our Upgrading Airbyte documentation instructs users to first spin down their existing deployment before upgrading to a new one
    • Therefore, if OSS users instead try to upgrade their airbyte deployments in-flight, that behavior is already undefined and we should not try to account for that case. So, we should be able to rely on the fact that OSS deployments will be spun down before being upgraded.

Given the above points, this simplifies the migration plan here to just flipping the feature flag to true and adding back the migration logic that we had in the ServerApp at one point.

However, given that tearing down an existing Airbyte deployment and spinning it back up could potentially result in some jobs being in a strange state (e.g. the state is RUNNING but nothing is actually handling that job), the work to make our connection manager temporal workflows properly handle all unexpected job states (tickets coming soon) should be a pre-requisite for this migration, to ensure that the new temporal scheduler can properly recover from any weird states that happen as a result of this migration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform needs-triage team/platform-move type/enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
@bleonard @benmoriceau @lmossman and others