-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add max retry while creating streams in subscription #594
Conversation
Codecov Report
@@ Coverage Diff @@
## master #594 +/- ##
=======================================
- Coverage 98.09% 98% -0.1%
=======================================
Files 13 13
Lines 841 853 +12
Branches 175 180 +5
=======================================
+ Hits 825 836 +11
Misses 1 1
- Partials 15 16 +1
Continue to review full report at Codecov.
|
/*! | ||
* Time to reset retries counter. | ||
*/ | ||
const RESET_RETRIES_COUNT_TIME_MS = 30000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does 30 seconds come from? We might want to get some input from @anguillanneuf here, but for some reason I seem to remember the PubSub team requesting that we retry for up to 5 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@callmehiphop Idea is that if stream starts failing, almost all active/creating new stream will fail with in that timeout(30s by default) and it will retry for max number of retries specified(3 by default). 10-30s is good enough to determine if there is a connection/authentication issue to start with or if it needs refresh token/deadline exceeded after initial setup. 5 min to keep retrying is too long for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can understand the need to modify our retry mechanisms to support unauthenticated requests, but this change makes me a little nervous. My concern is that we start seeing an increase of errors that kill the subscriber that would otherwise be retried within that 5 minute duration. I would be interested to know how other clients handle retrying this specific rpc.
@sduskis any thoughts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding @jadekler @jskeet for comments. I don't know if the 90 seconds found in GAPIC applies here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely need to have a more holistic look at this topic. A couple of notes:
- I think an ideal state to be in is that the library creates no deadlines, instead opting to allow the user to specify a deadline. In some languages you can do this with contexts, but not all languages have contexts. We (client libs team, AC tools team) are investigating this topic later today.
- The list of codes to retry below seems too broad. I've noticed this in several pubsub clients. I've got an action item to chat with @sduskis about this. That said, immediate feedback on this PR is that I'm a thumbs down on this PR adding UNAUTHENTICATED to retry codes. It's a signal to the user that their config is borked and should not be retried.
@@ -301,8 +298,14 @@ export class MessageStream extends PassThrough { | |||
*/ | |||
private _onEnd(stream: PullStream, status: StatusObject): void { | |||
this._removeStream(stream); | |||
|
|||
if (RETRY_CODES.includes(status.code)) { | |||
if (Date.now() - this._timeSinceFirstError > this._options.timeout!) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to clarify, this will wait to see if the stream was open for more than 30 seconds (or whatever it was set to) and if so we consider it a success?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its actually checks if there are any errors in stream try creating new stream for 30 sec or max retries and fail afterwards.
} | ||
if ( | ||
RETRY_CODES.includes(status.code) && | ||
this._retries++ < this._options.maxRetries! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't really uncommon for all the streams to end around the same time, I think if this were to happen then we would likely blow past the default maxRetries
pretty quick and end up having too few streams open at any given time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_fillStreamPool actually fills up stream pool, so it will create all the stream it requires to fill up the pool back even on last retry. The scenario you mentioned can happen only if there is a single stream thats connected and all other old and newly created stream keeps failing which sounds less likely. In this case it will recreate all remaining stream if last non failing stream fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that is always true. I've seen scenarios where the ending of streams can be staggered ever so slightly and in some cases _fillStreamPool
will need to be called multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chance of that scenario could be reduced by decreasing timeout and increasing maxRetries. WDYT of adding call to _fillStreamPool in keepAliveHandler if there are not stream failure since specified timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of improving our refill strategy, but I think putting it in the keep alive handler is a little heavy handed. I think even just wrapping fillStreamPool
in something like setImmediate
might do the trick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fillStreamPool is actually a async function and its getting executed in separate event loop than the one that failed stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, the thought I was trying to convey is that we might want to slightly stagger the call to fill the pool so that more streams have a chance to close.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@callmehiphop Added fix to fill up pool if stream errors out after retries are exhausted and there are less active stream in a pool. Also changed logic for retry count to count multiple stream failure as single retry.
* @param {object} client The gRPC client to wait for. | ||
* @returns {Promise} | ||
*/ | ||
private async _waitForClientReady(client: ClientStub): Promise<void> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we talked about deleting this, so I was expecting it, do we also need to enable channel pooling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That option is available by default.
const pubsub = new PubSub(
{
'grpc_gcp.apiConfig': {
"channelPool": {
"maxSize": 5
},
}
}
);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this might be problematic. My understanding is that we're making the switch to the pure JS version of gRPC soon and it doesn't support gcp. @alexander-fenster can you confirm that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. If you need to use this grpc_gcp
functionality you'll need to do two things:
- pass C-code gRPC as a parameter to client constructor;
- set the
grpc_gcp
options in Pub/Sub code since they will be removed from gax.
I'm going to do the same for Spanner. If Pub/Sub is also affected, we can postpone its upgrade to gax v1.0.0 until this is fixed in Pub/Sub code (together with Spanner).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alexander-fenster! Is the plan to keep Spanner on the C version until gcp support finds its way to grpc-js?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Also, we will remove all grpc-gcp code from gax (actually, already removed, just not yet released as v1.0.0) since it's not a good place for it (and an extra dependency on grpc), so libraries that want to use grpc-gcp should depend on it directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know. I think we're going to want postpone the gax upgrade here unless @ajaaym can think of another fix that doesn't involve channel pooling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@callmehiphop WDYT of option in getClient of pubsub to provide new client vs cached one? Then depending on maxStreams we can get create maxStreams/100 + 1 client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that sounds good. Out of curiosity have we tested against grpc-js to see if it is affected by this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was playing around with that yesterday and looks like its not that stable yet. Yes there is a limitation of 100 concurrent stream. Also I dont see option to set flow control limit which would be needed for load balancing ultra slow consumer. When testing with 100 streams I was randomly getting this error along with DEADLINE EXCEEDED. I think its better we hold on to upgrade until we have this option and well tested with different conditions.
Error [ERR_STREAM_WRITE_AFTER_END]: write after end
at writeAfterEnd (_stream_writable.js:243:12)
at ClientHttp2Stream.Writable.write (_stream_writable.js:291:5)
at filterStack.sendMessage.then (/Users/ajaymovalia/work/node/testwithexpress2/node_modules/@grpc/grpc-js/build/src/call-stream.js:396:34)
Emitted 'error' event at:
at onwriteError (_stream_writable.js:431:12)
at onwrite (_stream_writable.js:456:5)
at process._tickCallback (internal/process/next_tick.js:63:19)
Fixes #550 #318