Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Azure Event hub | One partition suddenly stops receiving messages #15164

Closed
yuhaii opened this issue Sep 14, 2020 · 15 comments
Closed

[BUG] Azure Event hub | One partition suddenly stops receiving messages #15164

yuhaii opened this issue Sep 14, 2020 · 15 comments
Assignees
Labels
customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs question The issue doesn't require a change to the product in order to be resolved. Most issues start as that

Comments

@yuhaii
Copy link

yuhaii commented Sep 14, 2020

Describe the bug
We use the below SDK to receiving message from event hub.

com.azure azure-messaging-eventhubs 5.1.1 com.azure azure-messaging-eventhubs-checkpointstore-blob 1.1.1

But one partition #3 suddenly stop receiving messages at 9/11 1:22 UTC. We can see its checkpoint didn't update.

image

The outgoing message would drop accordingly.

image

It recovered at 9/11 5:02 UTC. We can see the #3 partition checkpoint recover update at this time.

image

We checked the sending messages and confirmed that there were message continue sending to event hub partition #3 from 9/11 1:22 to 5:02 UTC. But we checked the log in customer code and confirmed that the partition #3 receive call back function processContext didn't been called at this time range.

__public EventProcessorClient eventProcessorClientBuilder(
@Autowired CheckpointStore checkpointStore,
@Autowired EventHubRecordProcessor eventHubRecordProcessor) {

return new EventProcessorClientBuilder()
    .connectionString(connectionstring, eventHubName)
    .consumerGroup(consumerGroupName)
    .processEventBatch(
        eventHubRecordProcessor::processContext, batchSize, Duration.ofSeconds(maxWaitTime))
    .processPartitionClose(eventHubRecordProcessor::closeContext)
    .processPartitionInitialization(eventHubRecordProcessor::initContext)
    .processError(eventHubRecordProcessor::errorContext)
    .checkpointStore(checkpointStore)
    .buildEventProcessorClient();

}__

_public void processContext(EventBatchContext eventContext) {

List<EventData> eventDataList = eventContext.getEvents();
int msgListSize = eventDataList.size();
if (msgListSize == 0) {
  return;
}
LOGGER.info("Batch received of size: {}", msgListSize);
String partitionId = eventContext.getPartitionContext().getPartitionId();
EventData lastEvent = eventDataList.get(msgListSize - 1);
long sequenceNo = lastEvent.getSequenceNumber();

// Logs request
LOGGER.trace(
    "EventHubRecordProcessor-onEvents msg-list-size, {}, partition-id, {}",
    msgListSize,
    partitionId);

try {

  LOGGER.info("[EVENT-HUB] before batch pid {} sequence {}", partitionId, sequenceNo); /// ---> log show this function not called for #3 partition

  // If all the processing is done without error then only we do checkpointing
  eventDataBatchProcessor.processBatch(eventDataList);

  LOGGER.info("[EVENT-HUB] after batch pid {} sequence {}", partitionId, sequenceNo);

  // Checkpoint after checkpointIntervalMillis
  long now = System.currentTimeMillis();
  if (now > checkpointTimeMap.getOrDefault(partitionId, 0L)) {
    doCheckpointing(eventContext, partitionId, sequenceNo);
    checkpointTimeMap.put(partitionId, now + checkpointIntervalMillis);
  }

} catch (Exception e) {
  LoggerUtil.logErrorMessage(
      LOGGER,
      "[DATA-DROPPED-CTS-WRITER] iot-cts-writer-eph-failed attempting retry with Exception : [{}]",
      e);
}

}_

image

We tried to update SDK to following latest beta vesion. But issue still exits.

https://mvnrepository.com/artifact/com.azure/azure-messaging-eventhubs/5.2.0-beta.2
https://mvnrepository.com/artifact/com.azure/azure-messaging-eventhubs-checkpointstore-blob/1.2.0-beta.2

Exception or Stack Trace
No exception. When message sending to hub. The receiver callback processContext didn't been called for that specified partition #3. The issue partition number is random. According to latest reproduce, it is on partition #3

To Reproduce
Steps to reproduce the behavior: please run attached code for 2-3 days, it will reproduce.

Code Snippet
I attached the code snippet for reference.

Expected behavior
The call back should be called normally for all partitions

Screenshots
see those screenshot in description

Setup (please complete the following information):

  • OS: windows
  • IDE : java/marven
  • Version of the Library used
com.azure azure-messaging-eventhubs 5.1.1 com.azure azure-messaging-eventhubs-checkpointstore-blob 1.1.1
@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Sep 14, 2020
@yuhaii
Copy link
Author

yuhaii commented Sep 14, 2020

code.zip
I attached the code for reference

@srnagar srnagar self-assigned this Sep 14, 2020
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Sep 14, 2020
@srnagar
Copy link
Member

srnagar commented Sep 14, 2020

We recently released newer versions of the Event Hubs libraries that contain a fix for this issue. Could you please try updating the version and see if you still have this issue?

azure-messaging-eventhubs - 5.2.0
azure azure-messaging-eventhubs-checkpointstore-blob - 1.2.0

This issue is related to #13785

@yuhaii
Copy link
Author

yuhaii commented Sep 15, 2020

Thanks! Srnagar. We will try this version SDK.

@yuhaii
Copy link
Author

yuhaii commented Sep 16, 2020

Hello @srnagar, before migrating to new version can you get a confirmation on if the issue
is fixed. Because switching between version consumes lot of resource bandwidth and has a significant amount of additional
cost associated with testing the new version for our quality & load performance.

I would request a confirmation on the subject from relevant product team. So that we can move ahead with confidence and
prevent any unwanted migration ahead.

@vinceve
Copy link

vinceve commented Sep 26, 2020

@yuhaii we had similar issues on our setup and for now it seems resolved. So the update worked for us.

@yuhaii
Copy link
Author

yuhaii commented Sep 28, 2020

Got it. Thanks for your confirmation. Vinceve

@srnagar
Copy link
Member

srnagar commented Sep 28, 2020

Thanks for the confirmation @vinceve! Closing this issue.

@srnagar srnagar closed this as completed Sep 28, 2020
@vinceve
Copy link

vinceve commented Oct 3, 2020

@srnagar this night it stopped working for us. I guess the bug is still persistent.

Screenshot 2020-10-03 at 08 43 21

The blue line are incoming messages. And the orange line is outgoing after a reboot.

I will send you the logs.

@pbsf
Copy link

pbsf commented Nov 9, 2020

@vinceve, any updates on that? The same issue just occurred in one of my consumers. We are using version 5.2.0.

@yuhaii
Copy link
Author

yuhaii commented Jan 5, 2021

Happy new year, @srnagar. This issue reproduced again on 5.2.0.

We observed that the checkpointing of the partition was stuck for couple of days and it was reset by us manually. Please find the attached screenshot of the metric.

image

When we use old SDK, we can break the lease of that checkpoint file to mitigate the issue. But in new SDK, the checkpoint file already been un-released. We have to restart the application. This is our production application, is there any good workaround if you can't fix this issue immediately? We don't want to restart the production application each time when such issue happens.

Could you please help double checking this issue? Thanks in advance.

@srnagar
Copy link
Member

srnagar commented Jan 5, 2021

@yuhaii as discussed offline, please use version 5.3.1 as it contains a fix for this issue.

@yuhaii
Copy link
Author

yuhaii commented Jan 6, 2021

understand, let us try v5.3.1. thanks for your confirmation, @srnagar!

@yuhaii
Copy link
Author

yuhaii commented Feb 1, 2021

Hello @srnagar , good day. Our customer reported that they did load testing with the latest event hub sdk version, still we are facing checkpoint related issue.

compile group: 'com.azure', name: 'azure-messaging-eventhubs', version: '5.4.0'
compile group: 'com.azure', name: 'azure-messaging-eventhubs-checkpointstore-blob', version: '1.4.0'

The checkpoint and ownership blobs not getting updated. PFB details for reference:
image

Could you please help double us double check this issue? Thank you.

@srnagar
Copy link
Member

srnagar commented Feb 6, 2021

@yuhaii could you please share logs when this issue happened? This is not the case when partitions stopped receiving events. In this case, the ownership is not updated which requires logs for further investigation.

@lovababu
Copy link

lovababu commented Jun 30, 2021

we started seeing same issue with java SDK (azure-eventhubs-eph v2.1.0), is this has been addressed in eph library too? @srnagar

@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
None yet
Development

No branches or pull requests

5 participants