Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] EventProcessorClient stops reading from partitions after transient (timeout?) errors and never recovers. #14652

Closed
3 tasks done
shubhambhattar opened this issue Aug 31, 2020 · 6 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) question The issue doesn't require a change to the product in order to be resolved. Most issues start as that

Comments

@shubhambhattar
Copy link
Contributor

shubhambhattar commented Aug 31, 2020

Describe the bug
EventProcessorClient stops reading from a few partitions. There comes some transient errors and the client never recovers from it.

Exception or Stack Trace

2020-08-26 08:54:38,502 [single-1] WARN  c.a.m.e.i.EventHubConnectionProcessor - Retry #2. Transient error occurred. Retrying after 14575 ms.
org.apache.qpid.proton.engine.TransportException: connection aborted, errorContext[NAMESPACE: some-namespace.servicebus.windows.net]
2020-08-26 08:54:38,523 [single-1] WARN  c.a.c.a.i.h.ConnectionHandler - onTransportError hostname[some-namespace.servicebus.windows.net:5671], connectionId[MF_2d5f6d_1598431778229], error[org.apache.qpid.proton.engine.TransportException: connection aborted]
2020-08-26 08:54:38,523 [single-1] ERROR c.a.c.a.i.ReactorConnection - connectionId[MF_2d5f6d_1598431778229] Error occurred in connection handler.
org.apache.qpid.proton.engine.TransportException: connection aborted, errorContext[NAMESPACE: some-namespace.servicebus.windows.net]
2020-08-26 08:54:49,246 [parallel-1] WARN  c.a.c.a.i.RetryUtil - Retry attempts are exhausted. Current: 4. Max: 3.
2020-08-26 08:54:49,246 [boundedElastic-11] WARN  c.a.m.e.PartitionPumpManager - Error receiving events from partition 7
Did not observe any item or terminal signal within 60000ms in 'takeUntil' (and no fallback has been configured)
2020-08-26 08:54:49,247 [boundedElastic-11] ERROR c.e.e.i.ProcessError - Error while consuming from EventHub partition 7
java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 60000ms in 'takeUntil' (and no fallback has been configured)
	at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.handleTimeout(FluxTimeout.java:288)
	at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.doTimeout(FluxTimeout.java:273)
	at reactor.core.publisher.FluxTimeout$TimeoutTimeoutSubscriber.onNext(FluxTimeout.java:395)
	at reactor.core.publisher.StrictSubscriber.onNext(StrictSubscriber.java:89)
	at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onNext(FluxOnErrorResume.java:73)
	at reactor.core.publisher.MonoDelay$MonoDelayRunnable.run(MonoDelay.java:117)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
2020-08-26 08:55:37,056 [parallel-1] WARN  c.a.m.e.PartitionBasedLoadBalancer - Load balancing for event processor failed - Did not observe any item or terminal signal within 60000ms in 'flatMapMany' (and no fallback has been configured)
Did not observe any item or terminal signal within 60000ms in 'flatMapMany' (and no fallback has been configured)
2020-08-26 08:55:37,056 [parallel-1] ERROR c.e.e.i.ProcessError - Error while consuming from EventHub partition NONE
java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 60000ms in 'flatMapMany' (and no fallback has been configured)
	at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.handleTimeout(FluxTimeout.java:288)
	at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.doTimeout(FluxTimeout.java:273)
	at reactor.core.publisher.FluxTimeout$TimeoutTimeoutSubscriber.onNext(FluxTimeout.java:395)
	at reactor.core.publisher.StrictSubscriber.onNext(StrictSubscriber.java:89)
	at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onNext(FluxOnErrorResume.java:73)
	at reactor.core.publisher.MonoDelay$MonoDelayRunnable.run(MonoDelay.java:117)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
2020-08-26 08:55:38,523 [single-1] WARN  c.a.m.e.i.EventHubConnectionProcessor - Retry #2. Transient error occurred. Retrying after 14575 ms.
org.apache.qpid.proton.engine.TransportException: connection aborted, errorContext[NAMESPACE: some-namespace.servicebus.windows.net]
2020-08-26 08:55:38,544 [single-1] WARN  c.a.c.a.i.h.ConnectionHandler - onTransportError hostname[some-namespace.servicebus.windows.net:5671], connectionId[MF_96d405_1598431877052], error[org.apache.qpid.proton.engine.TransportException: connection aborted]
2020-08-26 08:55:38,544 [single-1] ERROR c.a.c.a.i.ReactorConnection - connectionId[MF_96d405_1598431877052] Error occurred in connection handler.
org.apache.qpid.proton.engine.TransportException: connection aborted, errorContext[NAMESPACE: some-namespace.servicebus.windows.net]
2020-08-26 08:56:38,545 [single-1] WARN  c.a.m.e.i.EventHubConnectionProcessor - Retry #2. Transient error occurred. Retrying after 14575 ms.
org.apache.qpid.proton.engine.TransportException: connection aborted, errorContext[NAMESPACE: some-namespace.servicebus.windows.net]

To Reproduce
Just instantiate an EventProcessorClient and let it run. In my case, I've an EventHub namespace with 20 TU and 1 EventHub (with 32 partitions).

Expected behavior
EventProcessorClient should recover from the error and start reading from those partitions rather than just stop reading from them.

Screenshots
https://imgur.com/a/67DbJtD
Please check these screenshots. I've the history of last 10 days and it can be seen that once the consumption from a partition stops, it never recovers.

Setup (please complete the following information):

  • Version of the Library used: 5.1.2

Additional context
Seems like this problem has been in the SDK for quite some time. Not the first time I reported this. Check

Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • Bug Description Added
  • Repro Steps Added
  • Setup information Added
@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Aug 31, 2020
@joshfree joshfree added Client This issue points to a problem in the data-plane of the library. Event Hubs labels Aug 31, 2020
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Aug 31, 2020
@joshfree joshfree added the pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) label Aug 31, 2020
@joshfree
Copy link
Member

Thanks for filing this issue @shubhambhattar. @srnagar @conniey @YijunXieMS can you please follow up?

@srnagar
Copy link
Member

srnagar commented Sep 14, 2020

@shubhambhattar A new version of Event Hubs was released that contains a fix of this issue. Could you please try the new version and verify if this issue still exists?
azure-messaging-eventhubs - 5.2.0

This issue is related to #13785

@srnagar srnagar closed this as completed Sep 14, 2020
@shubhambhattar
Copy link
Contributor Author

@srnagar Still doesn't work. SDK stops reading from certain partitions and never recovers.
Screenshot 2020-09-29 at 01 22 40

{partitionId="13"} 490.5
{partitionId="31"} 492.91999999999996
{partitionId="4"} 493.14
{partitionId="5"} 493.96
{partitionId="21"} 494.15999999999997
{partitionId="16"} 493.06
{partitionId="1"} 493.88
{partitionId="18"} 494.8
{partitionId="27"} 492.97999999999996
{partitionId="19"} 495.31999999999994
{partitionId="12"} 0
{partitionId="11"} 492.58
{partitionId="29"} 494.06
{partitionId="6"} 492.6
{partitionId="28"} 493.91999999999996
{partitionId="7"} 492.93999999999994
{partitionId="30"} 0
{partitionId="8"} 494.35999999999996
{partitionId="0"} 0
{partitionId="15"} 491.56
{partitionId="10"} 494.28
{partitionId="17"} 494.5
{partitionId="14"} 491.5
{partitionId="3"} 492.9
{partitionId="26"} 492.23999999999995
{partitionId="25"} 492.35999999999996
{partitionId="20"} 495.5
{partitionId="24"} 0
{partitionId="22"} 0
{partitionId="9"} 494.84
{partitionId="23"} 492.23999999999995
{partitionId="2"} 0

Snapshot at a certain time also shows that messages from some partitions aren't being consumed.

@srnagar
Copy link
Member

srnagar commented Sep 28, 2020

@shubhambhattar could you please provide the complete logs for the partitions that stopped receiving (at least for a couple of hours since the receiving stopped). Also, can you share the code snippet for how you are setting up your client?

@shubhambhattar
Copy link
Contributor Author

shubhambhattar commented Oct 2, 2020

@srnagar Code snippet for how my client is being setup is present here: https://github.com/shubhambhattar/azure-eventhub-replication/blob/main/src/main/java/com/example/ehreplication/config/EHConsumerConfig.java#L60

I've added a sample project here: https://github.com/shubhambhattar/azure-eventhub-replication using which I was trying out the SDK (if that's also needed).

@djangofan
Copy link

djangofan commented Feb 11, 2021

I get this error using this dependency, (and also with version 1.2.8) :

      <dependency>
            <groupId>com.microsoft.azure</groupId>
            <artifactId>spring-cloud-azure-eventhubs-stream-binder</artifactId>
            <version>2.0.0-beta.1</version>
        </dependency>

NOTE: I am not explicitly using the azure-messaging-eventhubs lib. Doesn't seem to break my consumer binding. Just throws the error every 60 seconds while service is idle.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
None yet
Development

No branches or pull requests

6 participants