Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix context propagation with vertx blocking #90

Merged
merged 2 commits into from
Jun 1, 2023
Merged

Conversation

slinkydeveloper
Copy link
Contributor

Should fix #89

@github-actions
Copy link
Contributor

github-actions bot commented May 31, 2023

Unit Test Results

110 tests  ±0   110 ✔️ ±0   8s ⏱️ ±0s
  13 suites ±0       0 💤 ±0 
  13 files   ±0       0 ±0 

Results for commit 2a11b3c. ± Comparison against base commit ca4617d.

♻️ This comment has been updated with latest results.

@slinkydeveloper slinkydeveloper added this to the 1A milestone May 31, 2023
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the problem @slinkydeveloper. I think the fix makes sense. What I would like to better understand is why the problem hasn't surfaced earlier. All syscalls should be called from a blocking thread when using the blocking SDK, right? Shouldn't then any syscall fail because the syscall interface is not set on the context? I guess my question boils down to understanding the threading model of the SDK.

@@ -113,7 +113,7 @@ public void handle(HttpServerRequest request) {
method,
otelContext,
isBlockingService ? currentContextExecutor(vertxCurrentContext) : null,
isBlockingService ? blockingExecutor() : null);
isBlockingService ? blockingExecutor(vertxCurrentContext) : null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add to the commit message an explanation why this is necessary.

@@ -27,6 +28,8 @@ public void greet(GreetingRequest request, StreamObserver<GreetingResponse> resp
var count = ctx.get(COUNTER).orElse(0L) + 1;
ctx.set(COUNTER, count);

ctx.sleep(Duration.ofSeconds(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why hasn't this problem been surfaced earlier? We do use the blocking SDK in our e2e tests, right? Shouldn't all syscalls go through the blocking executor and thereby need access to the syscalls interface?

Why is the sleep here needed? Shouldn't ctx.get already trigger a syscall?

Copy link
Contributor Author

@slinkydeveloper slinkydeveloper Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason we found this bug only now is the fact that in this test we're recreating the restateContext() everytime and that we invoke this service endpoint multiple times concurrently: https://github.com/restatedev/e2e/pull/99/files#diff-69049c4f12bc8afee2cc966d87f192481ebf34604ef755ba743beed70f83a68aR40. Deep in the code, restateContext() ends up accessing the Vert.x context to retrieve the syscalls object.

Copy link
Contributor

@tillrohrmann tillrohrmann Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess I don't fully understand how Vertx sets the context. I would have assumed that it should fail on the first syscall if run from the blocking thread w/o a specified context and not only if doing this multiple times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem is, with this test i cannot really reproduce this bug, because i need some sufficient parallelism. of invocations to the service endpoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about our e2e tests that use the blocking Java SDK?

None of them are recreating the restateContext() and don't have a sufficient amount of parallel invocations to the service endpoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My hunch here is that vertx.runBlocking always reschedules on the current context the first time, but after the first hop it might reschedule on another context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the problem only manifests with many parallel invocations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I don't have an answer :( I tried to reproduce this locally without much success. I guess it could be that vert.x is scheduling the executeBlocking in a different context under load, but i cannot prove it.

I would suggest to merge this PR, as the fix is probably correct anyway, and keep the issue open to investigate this further in future. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. I think it is ok to close the issue when merging this PR. If the problem re-emerges, then we can always re-open the issue.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1 for merging.

@slinkydeveloper slinkydeveloper merged commit 9be2216 into main Jun 1, 2023
@slinkydeveloper slinkydeveloper deleted the issues/89 branch June 1, 2023 14:17
slinkydeveloper added a commit that referenced this pull request May 15, 2024
….2f3e461f

2f3e461f Workflow api changes (#91)
1fa71a5b Add versioning info (#90)

git-subtree-dir: sdk-core/src/main/service-protocol
git-subtree-split: 2f3e461f0cc9fcd4e078d4ddf63f177248829949
tillrohrmann added a commit to tillrohrmann/sdk-java that referenced this pull request May 16, 2024
…1898426

1898426 Update min/max of protocol version fields in endpoint_manifest_schema
08871e0 Introduce service.Version and discovery.Version enums
2f3e461 Workflow api changes (restatedev#91)
1fa71a5 Add versioning info (restatedev#90)

git-subtree-dir: sdk-core/src/main/service-protocol
git-subtree-split: 1898426fc98c16d704068594cd54394912845ff7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants