Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addressing IllegalStateException due to double free of Connection reference by the Transport #34122

Merged
merged 2 commits into from
Mar 21, 2023

Conversation

anuchandy
Copy link
Member

@anuchandy anuchandy commented Mar 20, 2023

The problem - IllegalStateException

Some users reported that they see IllegalStateException (ISE) randomly.

[reactor-executor-1] [WARN] c.a.c.a.i.ReactorExecutor - {
  "az.sdk.message":"Unhandled exception while processing events in reactor, report this error.",
  "exception":"java.lang.IllegalStateException","connectionId":"MF_497bd3_1679011231485"
}

The reason for IllegalStateException

While there is no consistent repro, we were able to experience ISE randomly with network disconnect. This section documents the reason from stack-trace and code analysis.

The Qpid Connection object is a reference counted endpoint. As each referring Qpid resource (Reactor, Transport, Session) using it gets freed, the reference count gets decremented.

The association (binding) of Qpid Transport with the Qpid Connection increases the Connection's reference count by one.

At the end of the Transport's life, two things (of many) need to happen -

  1. unbind the association between Transport and Connection; otherwise, it's a memory leak.
  2. Decrement the Connection's reference count.

The Transport::unbind API takes care of both, i.e., in addition to unbinding, the API reduces the Connection's reference count.

The code flow leads to IllegalStateException

The 'ConnectionHandler' has a transport_error event callback performing Transport unbind.

@Override
public void onTransportError(Event event) {
    final Connection connection = event.getConnection();
    final Transport transport = event.getTransport();
    ...
    transport.unbind();
}

The Qpid 'Default Global IO Handler' also performs Transport unbind on the transport_closed event.

case TRANSPORT_CLOSED:
   event.getTransport().unbind();
   break;

While Transport is only ONE logical reference to the Connection, two terminal (transport_error, transport_closed) events of the same Transport may reduce the reference count by TWO. When the stack runs into this undesired reference count reduction, it MAY lead to IllegalStateException (ISE).

void decref() {
    refcount--;
    if (refcount == 0) {
        ...
    } else if (refcount < 0) {
        throw new IllegalStateException();
    }
}

The reason for saying, on transport_closed event, IllegalStateException "MAY" happen, making ISE random, is - as the circumstance also requires all other referring resources to reduce the reference count so that the value is zero by the time the 'Default Global IO Handler' get to handle the transport_closed event.

The expected code behavior for correctness

One may ask - Why is it required to unbind on transport_error when transport_closed already does unbind? Digging the history indicates that Qpid IO may only trigger transport_error but not transport_closed.

For correctness, the unbind needs to happen at least once (else memory leak) and at most once (else undesired reference count reduction) in the following 3 mutually exclusive cases -

  1. transport_closed triggered but transport_error never trigerred.
  2. transport_error triggered but transport_closed never trigerred.
  3. transport_error then transport_closed trigerred.

The transport_closed handling in TransportHandler

The Azure Amqp-Core has a type TransportHandler. This Handler has a callback for transport_closed event, with a logic that checks if it's safe to unbind.

The safety check is via 'connection.getTransport'; if it returns null, it means unbind is already done, so don't do another unbind potentially causing undesired reference count reduction.

final Transport transport = event.getTransport();
final Connection connection = event.getConnection();
...
if (transport != null && connection != null && connection.getTransport() != null) {
    transport.unbind();
}

But why the TransportHandler is not preventing ISE today? The TransportHandler is "added" as the Global handler after the Qpid 'Default Global IO Handler',

final TransportHandler transportHandler = new TransportHandler(connectionId);
....
final Reactor reactor = Proton.reactor(reactorOptions, ..);
reactor.getGlobalHandler.add(transportHandler);
//       |                            |
//       |-- returns IOHandler.       |-- adds TransportHandler after IOHandler.

since TransportHandler is the second Global Handler, the first handler i.e. Qpid 'Default Global IOHandler' doing unbind without safetly check and when reference count is already zero can cause IllegalStateException.

Addressing the problem

To address the problem, change TransportHandler to inherit from current 'Default Qpid Global IOHandler' and make it the "only" Global Handler. Such that,

  1. TransportHandler becomes the one handling the transport_closed event that does a safety check before unbind.
  2. The transport_closed event is no longer handled the way it used to be (by the old 'Default Qpid Global IOHandler'). Preventing unsafe unbind that could cause undesired reference count decrement when combined with ConnectionHandler's transport_error handling.
  3. The other events handling continues to stay the same as new Global TransportHandler inherits from old Global IOHandler.

For better readability, rename the 'TransportHandler' to 'GlobalIOHandler'.

Additional notes

  1. The Connection's AmqpChannelProcessor retries on IllegalStateException; this analysis probably uncovers the potential reason for that magic retry case, but let's only remove the case with further analysis.

@anuchandy anuchandy added Azure.Core azure-core amqp Label for tracking issues related to AMQP labels Mar 20, 2023
@anuchandy anuchandy self-assigned this Mar 20, 2023
@azure-sdk
Copy link
Collaborator

API change check

API changes are not detected in this pull request.

@anuchandy anuchandy enabled auto-merge (squash) March 21, 2023 22:46
@anuchandy anuchandy merged commit 5338e6c into Azure:main Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
amqp Label for tracking issues related to AMQP Azure.Core azure-core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants