Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add max retry while creating streams in subscription #594

Closed
wants to merge 10 commits into from

Conversation

ajaaym
Copy link

@ajaaym ajaaym commented May 6, 2019

Fixes #550 #318

  • Tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label May 6, 2019
@ajaaym ajaaym force-pushed the subscripton_rety branch from b1303ca to b82589a Compare May 6, 2019 16:02
@ajaaym ajaaym requested a review from callmehiphop May 6, 2019 16:03
@codecov
Copy link

codecov bot commented May 6, 2019

Codecov Report

Merging #594 into master will decrease coverage by 0.09%.
The diff coverage is 95.65%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #594     +/-   ##
=======================================
- Coverage   98.09%    98%   -0.1%     
=======================================
  Files          13     13             
  Lines         841    853     +12     
  Branches      175    180      +5     
=======================================
+ Hits          825    836     +11     
  Misses          1      1             
- Partials       15     16      +1
Impacted Files Coverage Δ
src/message-stream.ts 99% <95.65%> (-1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6415e7c...533c01d. Read the comment docs.

/*!
* Time to reset retries counter.
*/
const RESET_RETRIES_COUNT_TIME_MS = 30000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does 30 seconds come from? We might want to get some input from @anguillanneuf here, but for some reason I seem to remember the PubSub team requesting that we retry for up to 5 minutes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@callmehiphop Idea is that if stream starts failing, almost all active/creating new stream will fail with in that timeout(30s by default) and it will retry for max number of retries specified(3 by default). 10-30s is good enough to determine if there is a connection/authentication issue to start with or if it needs refresh token/deadline exceeded after initial setup. 5 min to keep retrying is too long for this.

Copy link
Contributor

@callmehiphop callmehiphop May 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand the need to modify our retry mechanisms to support unauthenticated requests, but this change makes me a little nervous. My concern is that we start seeing an increase of errors that kill the subscriber that would otherwise be retried within that 5 minute duration. I would be interested to know how other clients handle retrying this specific rpc.

@sduskis any thoughts here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding @jadekler @jskeet for comments. I don't know if the 90 seconds found in GAPIC applies here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely need to have a more holistic look at this topic. A couple of notes:

  • I think an ideal state to be in is that the library creates no deadlines, instead opting to allow the user to specify a deadline. In some languages you can do this with contexts, but not all languages have contexts. We (client libs team, AC tools team) are investigating this topic later today.
  • The list of codes to retry below seems too broad. I've noticed this in several pubsub clients. I've got an action item to chat with @sduskis about this. That said, immediate feedback on this PR is that I'm a thumbs down on this PR adding UNAUTHENTICATED to retry codes. It's a signal to the user that their config is borked and should not be retried.

@@ -301,8 +298,14 @@ export class MessageStream extends PassThrough {
*/
private _onEnd(stream: PullStream, status: StatusObject): void {
this._removeStream(stream);

if (RETRY_CODES.includes(status.code)) {
if (Date.now() - this._timeSinceFirstError > this._options.timeout!) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to clarify, this will wait to see if the stream was open for more than 30 seconds (or whatever it was set to) and if so we consider it a success?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its actually checks if there are any errors in stream try creating new stream for 30 sec or max retries and fail afterwards.

}
if (
RETRY_CODES.includes(status.code) &&
this._retries++ < this._options.maxRetries!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't really uncommon for all the streams to end around the same time, I think if this were to happen then we would likely blow past the default maxRetries pretty quick and end up having too few streams open at any given time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_fillStreamPool actually fills up stream pool, so it will create all the stream it requires to fill up the pool back even on last retry. The scenario you mentioned can happen only if there is a single stream thats connected and all other old and newly created stream keeps failing which sounds less likely. In this case it will recreate all remaining stream if last non failing stream fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is always true. I've seen scenarios where the ending of streams can be staggered ever so slightly and in some cases _fillStreamPool will need to be called multiple times.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chance of that scenario could be reduced by decreasing timeout and increasing maxRetries. WDYT of adding call to _fillStreamPool in keepAliveHandler if there are not stream failure since specified timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of improving our refill strategy, but I think putting it in the keep alive handler is a little heavy handed. I think even just wrapping fillStreamPool in something like setImmediate might do the trick.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fillStreamPool is actually a async function and its getting executed in separate event loop than the one that failed stream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the thought I was trying to convey is that we might want to slightly stagger the call to fill the pool so that more streams have a chance to close.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@callmehiphop Added fix to fill up pool if stream errors out after retries are exhausted and there are less active stream in a pool. Also changed logic for retry count to count multiple stream failure as single retry.

* @param {object} client The gRPC client to wait for.
* @returns {Promise}
*/
private async _waitForClientReady(client: ClientStub): Promise<void> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we talked about deleting this, so I was expecting it, do we also need to enable channel pooling?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That option is available by default.
const pubsub = new PubSub(
{
'grpc_gcp.apiConfig': {
"channelPool": {
"maxSize": 5
},
}
}
);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this might be problematic. My understanding is that we're making the switch to the pure JS version of gRPC soon and it doesn't support gcp. @alexander-fenster can you confirm that?

Copy link
Contributor

@alexander-fenster alexander-fenster May 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. If you need to use this grpc_gcp functionality you'll need to do two things:

  1. pass C-code gRPC as a parameter to client constructor;
  2. set the grpc_gcp options in Pub/Sub code since they will be removed from gax.

I'm going to do the same for Spanner. If Pub/Sub is also affected, we can postpone its upgrade to gax v1.0.0 until this is fixed in Pub/Sub code (together with Spanner).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alexander-fenster! Is the plan to keep Spanner on the C version until gcp support finds its way to grpc-js?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Also, we will remove all grpc-gcp code from gax (actually, already removed, just not yet released as v1.0.0) since it's not a good place for it (and an extra dependency on grpc), so libraries that want to use grpc-gcp should depend on it directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know. I think we're going to want postpone the gax upgrade here unless @ajaaym can think of another fix that doesn't involve channel pooling.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@callmehiphop WDYT of option in getClient of pubsub to provide new client vs cached one? Then depending on maxStreams we can get create maxStreams/100 + 1 client.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that sounds good. Out of curiosity have we tested against grpc-js to see if it is affected by this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was playing around with that yesterday and looks like its not that stable yet. Yes there is a limitation of 100 concurrent stream. Also I dont see option to set flow control limit which would be needed for load balancing ultra slow consumer. When testing with 100 streams I was randomly getting this error along with DEADLINE EXCEEDED. I think its better we hold on to upgrade until we have this option and well tested with different conditions.

Error [ERR_STREAM_WRITE_AFTER_END]: write after end
    at writeAfterEnd (_stream_writable.js:243:12)
    at ClientHttp2Stream.Writable.write (_stream_writable.js:291:5)
    at filterStack.sendMessage.then (/Users/ajaymovalia/work/node/testwithexpress2/node_modules/@grpc/grpc-js/build/src/call-stream.js:396:34)
Emitted 'error' event at:
    at onwriteError (_stream_writable.js:431:12)
    at onwrite (_stream_writable.js:456:5)
    at process._tickCallback (internal/process/next_tick.js:63:19)

@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label May 13, 2019
@ajaaym ajaaym closed this May 14, 2019
feywind pushed a commit to feywind/nodejs-pubsub that referenced this pull request Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes This human has signed the Contributor License Agreement. 🚨 This issue needs some love.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Many parallel subscriptions in one process => DEADLINE_EXCEEDED upon ack
9 participants