Worker Deployment Version Drainage Status #7158

carlydf · 2025-01-24T10:06:36Z

What changed?

Add DrainageStatus child workflow to worker deployment system

Why?

To periodically update the version workflow with results from visibility.

How did you test it?

Functional tests. Currently broken (see todo comment in code)

Potential risks

Documentation

Is hotfix candidate?

service/worker/workerdeployment/version_workflow.go

service/worker/workerdeployment/workflow.go

service/worker/workerdeployment/version_workflow.go

Shivs11 · 2025-01-25T23:49:03Z

service/worker/workerdeployment/version_workflow.go

+	if args.GetCurrentSinceTime().AsTime().Equal(d.GetVersionState().GetCurrentSinceTime().AsTime()) &&
+		args.GetRampingSinceTime().AsTime().Equal(d.GetVersionState().GetRampingSinceTime().AsTime()) {
+		res := &deploymentspb.SyncVersionStateResponse{VersionState: d.VersionState}
+		return temporal.NewApplicationError("no change", errNoChangeType, res)


thinking out loud: I believe the idea is to use the function handleSyncState when we want to notify changes related to both ramping as well as current right?

I am having a hard time convincing myself that this validator is doing the right thing, now that we are trying to add ramping - in my head, the two operations are to a certain extent mutually exclusive. If a version x is already current, and the user calls SetCurrent on it again, shouldn't we be evaluating only whether GetCurrentSinceTime is equal to the local state? Why should we check the ramping time?

I know we are going to set the value of ramping_since_time when setCurrent is called too, but how are we going to pass in the same value stored inside d.GetVersionState().GetRampingSinceTime() from SetCurrent?

I think when SetCurrent is called, ramping_since_time will be set to nil, and we want to sync both values. Similarly, if SetRamp is called, current_since_time will be set to nil, which we want to sync across the workflows/task queues that need to know that.

I changed this to make more sense

I like the change more - no need to evaluate the timing since current was set just check if it was current/ramping

Shivs11 · 2025-01-26T00:07:22Z

proto/internal/temporal/server/api/deployment/v1/message.proto

+    // (-- api-linter: core::0140::prepositions=disabled
+    //     aip.dev/not-precedent: 'Since' captures the field semantics despite being a preposition. --)
+    // Nil if not ramping.
+    google.protobuf.Timestamp ramping_since_time = 3;


thinking out loud: if we are planning on using version_workflow.handleSyncState to also set a version to ramping, this should have a ramp percentage set right?

yes, I think so! I was thinking to wait to add that until the ramp PR, but we could pre-emptively add the field now. I'll at least add a todo

Shivs11 · 2025-01-26T00:15:09Z

service/worker/workerdeployment/util.go

-	WorkerDeploymentWorkflowType        = "temporal-sys-worker-deployment-workflow"
+	WorkerDeploymentVersionWorkflowType       = "temporal-sys-worker-deployment-version-workflow"
+	WorkerDeploymentWorkflowType              = "temporal-sys-worker-deployment-workflow"
+	WorkerDeploymentCheckDrainageWorkflowType = "temporal-sys-worker-deployment-check-drainage-workflow"


temporal-sys-worker-deployment-check-drainage-workflow -> temporal-sys-worker-deployment-drainage-workflow

no need to keep it super verbose; wdyt?

nice I like that

service/worker/workerdeployment/client.go

Shivs11 · 2025-01-26T00:30:57Z

service/worker/workerdeployment/version_workflow.go

+	if wasAcceptingNewWorkflows && !isAcceptingNewWorkflows {
+		state.DrainageInfo = &deploymentpb.VersionDrainageInfo{}
+		workflow.ExecuteChildWorkflow(ctx, WorkerDeploymentCheckDrainageWorkflowType, d.a.namespace)
+	}


say we have a version workflow that has become draining and kickstarts this workflow - now, say this version workflow CAN's - I might have missed this but I don't see a PARENT_CLOSE_POLICY for this child workflow

IMHO, we should:

set the policy to terminate so that the workflow terminates when we CAN

In the version_workflow.run function, if the drainage state is set to draining, we simply kickstart this child workflow again which shall continue execution

This will also call for the todo statement you have inside of drainage_workflow.go to be implemented since if this version workflow were to accept executions, we would have to send a signal to this workflow to stop carrying out execution

wdyt?

yes, I need to add the code handling CaN of the parent and also handling termination of the child workflow if the parent version starts accepting traffic again

Shivs11 · 2025-01-27T15:00:39Z

service/worker/workerdeployment/version_workflow.go

+			DeploymentName:   d.VersionState.DeploymentName,
+			Version:          d.VersionState.Version,
+			CurrentSinceTime: d.VersionState.CurrentSinceTime,
+			RampingSinceTime: d.VersionState.RampingSinceTime,
+			CreateTime:       d.VersionState.CreateTime,
+			DrainageInfo:     d.VersionState.DrainageInfo,


Is there point altering the workflow memo of the version workflow to store this? IMO, the 3.1 API's don't require querying visibility to receive information from version workflows directly. We only require these version workflows return their local state during DescribeVersion calls.

I think we should alter the worker-deployment workflow memo to have these additional fields you have added since the deployment workflow will now be the source-of-truth for things like "current-version", "ramping-version", etc. (RoutingInfo, basically)

wdyt?

After our discussion on what the memo is used for (to give info via DescribeWorkflowExecution and ListWorkflowExecutions) I removed everything here, because DescribeVersion uses query, not memo. I have a hunch that we might find it useful to put drainage info in here at some point, but we should only add that when we have a use-case for it

…p_percentage

Shivs11 · 2025-01-28T16:19:59Z

proto/internal/temporal/server/api/deployment/v1/message.proto

+    google.protobuf.Timestamp routing_update_time = 4;
+    // If this version is the current version of its deployment.
+    bool is_current = 5;
+    // Range: [0, 100]. Must be zero if is_current is true. Must be non-zero if `version.version`


may have to change this comment since we don't have a version.version (or we may have in the near future based on Shahab's latest API refinements)

Shivs11 · 2025-01-28T16:33:10Z

tests/deployment_version_test.go

+	go s.pollFromDeployment(ctx, tv2)
+	s.EventuallyWithT(func(t *assert.CollectT) {
+		a := assert.New(t)
+		resp, err := s.FrontendClient().DescribeWorkerDeploymentVersion(ctx, &workflowservice.DescribeWorkerDeploymentVersionRequest{
+			Namespace: s.Namespace().String(),
+			Version:   tv2.BuildID(),
+		})
+		a.NoError(err)
+		a.Equal(tv2.DeploymentSeries(), resp.GetWorkerDeploymentVersionInfo().GetDeploymentName())
+		a.Equal(tv2.BuildID(), resp.GetWorkerDeploymentVersionInfo().GetVersion())
+	}, time.Second*5, time.Millisecond*200)


this may not be what you are intending to do - I assume you wanted to start a new version workflow, under the same deployment name, right? However, both tv1 and tv2 are using tv.BuildID() which generates the same BuildID making only one version workflow registered

IMHO, maybe try using .WithBuildIDNumber(number) for both these separate tvs?

I think since I initiated them with this, now the build ids are different whenever I do tv.BuildId(). Thanks for catching this!

tv1 := testvars.New(s).WithBuildIDNumber(1) tv2 := testvars.New(s).WithBuildIDNumber(2)

…ynamic config, and test them

carlydf · 2025-01-29T05:31:20Z

tests/deployment_version_test.go

@@ -192,3 +207,173 @@ func (s *DeploymentVersionSuite) Name() string {
 		farm.Fingerprint32([]byte(fullName)),
 	)
 }
+
+func (s *DeploymentVersionSuite) TestDrainageStatus_SetCurrentVersion_NoOpenWFs() {


This tests the new parts of the feature (version stops accepting new executions --> child workflow starts --> child sets the status to DRAINING and sleeps for visibilityGracePeriod --> child queries visibility, counts 0 open workflows, and sets the status to DRAINED).

It does not test the visibility query, since there are no open workflows, but the visibility query code has not changed from the pre-release, so that should all still work. There are TODOs to test it in the comments below.

Shivs11

haven't yet peeked at the functional tests but the code looks nice - thanks for the annoying refactorings required with the latest API refinements too - I shall take a peek at the tests sometime later today but merging to iterate faster

Shivs11 · 2025-01-29T13:34:04Z

proto/internal/temporal/server/api/deployment/v1/message.proto

+message SyncVersionStateUpdateArgs {
+    // Last time `current_since_time`, `ramping_since_time, or `ramp_percentage` of this version changed.
+    google.protobuf.Timestamp routing_update_time = 4;

-    message SetCurrent {
-        // If last_became_current_time is present, then set the deployment's
-        // last_became_current_time to it and set is_current true. If it's missing,
-        // set is_current false.
-        google.protobuf.Timestamp last_became_current_time = 1;
-    }
+    // (-- api-linter: core::0140::prepositions=disabled
+    //     aip.dev/not-precedent: 'Since' captures the field semantics despite being a preposition. --)
+    // Nil if not current.
+    google.protobuf.Timestamp current_since_time = 5;
+
+    // (-- api-linter: core::0140::prepositions=disabled
+    //     aip.dev/not-precedent: 'Since' captures the field semantics despite being a preposition. --)
+    // Nil if not ramping. Updated when the version first starts ramping, not on each ramp change.
+    google.protobuf.Timestamp ramping_since_time = 6;
+
+    // Range: [0, 100]. Must be zero if the version is not ramping (i.e. `ramping_since_time` is nil).
+    // Can be in the range [0, 100] if the version is ramping.
+    float ramp_percentage = 7;


nit: these new fields are nice but they can be numbered from 1 since we haven't released this feature yet to customers and changes will be backwards compatible

Shivs11 · 2025-01-29T13:48:41Z

service/worker/workerdeployment/drainage_workflow.go

+			if info.Status == enumspb.VERSION_DRAINAGE_STATUS_DRAINED {
+				return nil
+			}
+			workflow.Sleep(ctx, refreshInterval)


this whole new file is really nice and the workflow makes sense - I wonder, in the future certainly not right now, if we want to implement a backoff strategy when the workflow sleeps

just some food for thought

Shivs11 · 2025-01-29T14:13:17Z

service/worker/workerdeployment/client.go

+			},
+			WorkflowVersioningMode: 0, // todo
+			CreateTime:             now,
+			RoutingUpdateTime:      now,


hmm, I think I might have realized there might be a lag in our implementation: Assume the following flow:

RegisterTaskQueueWorker -> updateWithStartWorkerDeploymentVersion -> handleRegisterWorker

In the piece of code highlighted, we update the routingUpdateTime to be now which is right. This shall make the local state of the version wf have the routingUpdateTime to also be now.

Now, when the update is going to take place, in handleRegisterWorker, we update the routingUpdateTime for the task-queue to be registered by again doing time.Now(). This may cause a lag between the local state's update time and the time being sent to the task-queue.

In handleRegisterWorker:236 // initial data data := &deploymentspb.DeploymentVersionData{ Version: d.VersionState.Version, RoutingUpdateTime: timestamppb.Now(), CurrentSinceTime: nil, // not current RampingSinceTime: nil, // not ramping RampPercentage: 0, // not ramping FirstPollerTime: args.FirstPollerTime, }

again, not a big deal given that routing will work - but just some food for thought if we should take the extra step and achieve no lag

hm yeah I think it would be better to pass the original routing time along

Shivs11 · 2025-01-29T14:55:26Z

merging this to speed up development - more tests to follow
cc - @carlydf

carlydf added 2 commits January 23, 2025 14:01

stash

3f03402

drainage child workflow

607fa55

carlydf requested a review from a team as a code owner January 24, 2025 10:06

carlydf marked this pull request as draft January 24, 2025 10:06

carlydf commented Jan 24, 2025

View reviewed changes

service/worker/workerdeployment/version_workflow.go Outdated Show resolved Hide resolved

carlydf added 3 commits January 24, 2025 13:05

rebase

934a497

fix todos

a5fbb70

make GetVersionDrainageStatus have a pointer receiver

b2fc6b4

Shivs11 reviewed Jan 26, 2025

View reviewed changes

Shivs11 reviewed Jan 27, 2025

View reviewed changes

carlydf added 3 commits January 27, 2025 20:45

address comments and handle todos

559bafb

basic tests just with SetCurrent, no open workflows

5e10ba9

replace current_since_time/ramping_since_time with is_current and ram…

c1a1376

…p_percentage

carlydf marked this pull request as ready for review January 28, 2025 07:08

Shivs11 reviewed Jan 28, 2025

View reviewed changes

carlydf added 8 commits January 28, 2025 09:29

make separate test vars and test works

dce80eb

rebase to versioning-3.1 ListDeployments changes

2cb7857

fix build

582c469

tests pass:

54d767d

tests working

667f20a

configure drainage refresh interval and visibility grace period via d…

cb1cfa4

…ynamic config, and test them

add todo comment

5745f69

more test todos

fac510c

carlydf commented Jan 29, 2025

View reviewed changes

Shivs11 approved these changes Jan 29, 2025

View reviewed changes

Shivs11 added 2 commits January 29, 2025 09:29

build & lint fixes

3daa7e0

lint

0ba86e3

Shivs11 merged commit 3d40535 into versioning-3.1 Jan 29, 2025
10 checks passed

Shivs11 deleted the cdf/drainage branch January 29, 2025 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker Deployment Version Drainage Status #7158

Worker Deployment Version Drainage Status #7158

carlydf commented Jan 24, 2025 •

edited

Loading

Shivs11 Jan 25, 2025

carlydf Jan 27, 2025

carlydf Jan 28, 2025

Shivs11 Jan 28, 2025

Shivs11 Jan 26, 2025

carlydf Jan 27, 2025

Shivs11 Jan 26, 2025

carlydf Jan 27, 2025

Shivs11 Jan 26, 2025

carlydf Jan 27, 2025

Shivs11 Jan 27, 2025 •

edited

Loading

carlydf Jan 28, 2025

Shivs11 Jan 28, 2025

Shivs11 Jan 28, 2025

carlydf Jan 29, 2025

carlydf Jan 29, 2025

Shivs11 left a comment

Shivs11 Jan 29, 2025

Shivs11 Jan 29, 2025

Shivs11 Jan 29, 2025

carlydf Jan 29, 2025

Shivs11 commented Jan 29, 2025

Worker Deployment Version Drainage Status #7158

Worker Deployment Version Drainage Status #7158

Conversation

carlydf commented Jan 24, 2025 • edited Loading

What changed?

Why?

How did you test it?

Potential risks

Documentation

Is hotfix candidate?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shivs11 Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shivs11 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shivs11 commented Jan 29, 2025

carlydf commented Jan 24, 2025 •

edited

Loading

Shivs11 Jan 27, 2025 •

edited

Loading