Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only emit AMQP channels downstream when they are active. #24582

Closed
conniey opened this issue Oct 6, 2021 · 1 comment · Fixed by #27400
Closed

Only emit AMQP channels downstream when they are active. #24582

conniey opened this issue Oct 6, 2021 · 1 comment · Fixed by #27400
Assignees
Labels
amqp Label for tracking issues related to AMQP Azure.Core azure-core Client This issue points to a problem in the data-plane of the library. Event Hubs pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) Service Bus
Milestone

Comments

@conniey
Copy link
Member

conniey commented Oct 6, 2021

  • Currently in AMQPChannelProcessor, when the CBS node is closed, it'll make a request upstream for a new instance. This instance is immediately emitted even if it's not active.

    • In the case that the underlying connection is closed, the link will never be active.
    • Consequently, there is a lot of back and forth when reconnection happens.
  • This should alleviate the mass of logs accumulated about retries and fix retry policy-related issues.

/cc @anuchandy

@conniey conniey added Client This issue points to a problem in the data-plane of the library. Azure.Core azure-core pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) amqp Label for tracking issues related to AMQP labels Oct 6, 2021
@conniey conniey added this to the [2021] November milestone Oct 6, 2021
@conniey conniey removed this from the [2021] November milestone Jan 14, 2022
@ramya-rao-a ramya-rao-a added this to the Backlog milestone Jan 14, 2022
@liukun-msft liukun-msft self-assigned this Feb 21, 2022
@liukun-msft
Copy link
Contributor

liukun-msft commented Mar 2, 2022

Log Analysis

azure-amqp-core version: 2.4.0

We can define 5 phases for the closing logs logs.md:

  1. Connection start to close
  2. CBS node start to close
  3. CBS node close timeout exception
  4. CBS node constantly retry logs
  5. ReactorExecutor closed

Call graph

Graph 1 - CBS node start to close and timeout exception

cbs-close-issue-1

Graph 2 - CBS node constantly retry and ReactorExecutor closed

cbs-close-issue-2

Issue 1: Two threads are created to close CBS node

Reason

We manually try to close an closed CBS node twice. and both got timeout.

First time:

...
//RequestResponseChannel
this.subscriptions = Disposables.composite(
    ...
    amqpConnection.getShutdownSignals().next().flatMap(signal -> {
        logger.verbose("Shutdown signal received.");
        return closeAsync();
    }).subscribe()
);

Second time:

//ReactorConnection
final Mono<Void> cbsCloseOperation;
if (cbsChannelProcessor != null) {
    cbsCloseOperation = cbsChannelProcessor.flatMap(channel -> channel.closeAsync());
} else {
    cbsCloseOperation = Mono.empty();
}

These two manually calls both encounter the timeout exceptions and create a new thread to resume.

Issue 2: CBS node close timeout exception

RequestResponseChannel sender and receiver links haven't sent the Link:Detach frame. And the onRemoveLinkClosed in link handlers haven't been triggered to emit complete closing signal to end the cbsClosingWithTimeout.

Why Link:Detach frame haven't been sent?

Because ReactorSession is closed ealier than RequestResponseChannel and Session:END frame will override the sending of Link:Detach frame. So when channel is ready to send Session:END and Link:Detach are not be scheduled to send, we will met this timeout issue.

When we use the debug mode, sometimes there is no timeout issue. This is because channel is ready until both Session:END and Link:Detach have been scheduled to send. In this case, channel will send Link:Detach first and then Session:END frame.

Issue 3: CBS node retry infinite loop which cause massive logs

Reason

When CBS node (RequestResponseChannel) is closed, it will emit a complete signal, and that signal is catch by connectionSubscription which will request a new upstream RequestResponseChannel.

But when new instance of RequestResponseChannel is created, it will be closed immediately since connection is shutdown. And that close action will emit a complete signal again and request upstream again.

This back and forth process cause massive log in console.

Code details
When CBS sender and receiver links are closed. It will emit a complete signal:

//RequestResponseChannel
private void onTerminalState() {
    ...
    endpointStates.emitComplete(((signalType, emitResult) -> onEmitSinkFailure(...)));
    ...
}

The complete signal is catched by connectionSubscription. Because channel status is not disposed currently and then goes to requestUpstream().

//AmqpChannelProcessor
connectionSubscription = endpointStatesFunction.apply(amqpChannel).subscribe(
        ...
        () -> {
                ...
                requestUpstream();
        });

Because CBS channel is requested from a repeat Flux inside createRequestResponseChannel(), so a RequestResponseChannel instance is created.

//ReactorConnection
protected AmqpChannelProcessor<RequestResponseChannel> createRequestResponseChannel(
    final Flux<RequestResponseChannel> createChannel = ...
         .map(reactorSession -> new RequestResponseChannel(...))
        .doOnNext(e -> {...})
        .repeat();

However, inside RequestResponseChannel constructor, it call closeAysc() and schedule link close tasks, which close current channel after it created.

...
//RequestResponseChannel
this.subscriptions = Disposables.composite(
    ...
    amqpConnection.getShutdownSignals().next().flatMap(signal -> {
        logger.verbose("Shutdown signal received.");
        return closeAsync();
    }).subscribe()
);

As new channel is closed, it will emit a complete signal to request new upstream, which goes to the beginning step.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
amqp Label for tracking issues related to AMQP Azure.Core azure-core Client This issue points to a problem in the data-plane of the library. Event Hubs pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) Service Bus
Projects
None yet
4 participants