Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Events passed to processEvents are not always ordered by sequenceNumber #23993

Closed
03byron opened this issue Nov 24, 2022 · 10 comments
Closed

Events passed to processEvents are not always ordered by sequenceNumber #23993

03byron opened this issue Nov 24, 2022 · 10 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Milestone

Comments

@03byron
Copy link

03byron commented Nov 24, 2022

  • Package Name: @azure/event-hubs
  • Package Version: 8.19.2
  • Operating system: Docker: node:18.12.1-alpine
  • nodejs
    • version: 18.12.1

Describe the bug
The list of events passed into the processEvents callback is not always ordered by sequenceNumber. This leads to processing the events in a wrong order. If that is intended it would be good to mention it in the documentation.

It does not happen very often - we see 6 occurrences in the last 4 hours. We processed ~117k events in that timeframe.

To Reproduce
Steps to reproduce the behavior:

  1. Setup an event consumer:
async function processEvents(events: ReceivedEventData[], context: PartitionContext) {
  if (events.length === 0) {
    return;
  }
  if (
    events.some((event, index) => {
      return (
        !!events[index + 1] && // its not the last event in the list
        event.sequenceNumber > events[index + 1].sequenceNumber // Current sequence number is larger than the next one
      ); 
    })
  ) {
    console.error(`Events received in wrong order: ${events.map(({ sequenceNumber }) => sequenceNumber).join(' ,')}`);
  }
}
  1. Observe the log for the above message.

Expected behavior
No error should appear in the log as events should be sorted by sequenceNumber.

Additional context
The samples we have contain all consecutive events in one call to processEvents. They are just not ordered appropriately.

We see log messages like:

Example 1

Events received in wrong order: 78716749 ,78716747 ,78716748 ,78716750 ,78716751 ,78716752 ,78716753 ,78716754 ,78716755 ,78716756 ,78716757 ,78716758 ,78716759 ,78716760 ,78716761 ,78716762 ,78716763 ,78716764 ,78716765 ,78716766 ,78716767 ,78716768 ,78716769 ,78716770 ,78716771 ,78716772 ,78716773 ,78716774 ,78716775 ,78716776 ,78716777 ,78716778 ,78716779

Example 2

Events received in wrong order: 402880 ,402881 ,402882 ,402877 ,402878 ,402879 ,402883 ,402884 ,402885 ,402886 ,402887 ,402888 ,402889 ,402890 ,402891 ,402892 ,402893 ,402894 ,402895 ,402896 ,402897 ,402898 ,402899 ,402900

@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Nov 24, 2022
@azure-sdk azure-sdk added Client This issue points to a problem in the data-plane of the library. Event Hubs needs-team-triage Workflow: This issue needs the team to triage. labels Nov 24, 2022
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Nov 24, 2022
@xirzec xirzec removed the needs-team-triage Workflow: This issue needs the team to triage. label Nov 28, 2022
@deyaaeldeen
Copy link
Member

Thanks for the report! Could you share the subscription id, EH namespace, and the Event Hub name, in addition to an approximate time window when the behavior was observed?

@03byron
Copy link
Author

03byron commented Dec 1, 2022

Hi @deyaaeldeen - thank you for the reply. We provided the requested details via a service ticket (ID 2212010050000732). Some findings worth sharing in the meantime:

  1. We see the issue on at least two different event streams
  2. We see the issue on at least two nodejs implementations made by different teams. However, they do not receive the same wrongly ordered sequences.
  3. We have one other nodejs implementation that differs a bit: Its subscribes once a day to a single partition(instead of all), processes all events collected since yesterday and sleeps again until the next day (whilst the others always listen). We could not find the issue in this process.

Because of the above observations our assumption is a race condition in the js sdk.

@deyaaeldeen
Copy link
Member

Thanks a lot for sharing the details and the additional observations!

We have one other nodejs implementation that differs a bit: Its subscribes once a day to a single partition(instead of all), processes all events collected since yesterday and sleeps again until the next day (whilst the others always listen). We could not find the issue in this process.

From the above, do I understand it correctly that you observe the issue when the consumer subscribed to multiple partitions at the same time but not when subscribed to a single one?

Please note that events are ordered in each partition individually but not across all partitions, i.e., sequence numbers are meaningful only within the context of a single partition. For a single processEvents call, the input event list comes from a single partition so I am puzzled that the issue happens when connecting to multiple partitions but not to one.

I will ask the service team to take a look. In the meantime, please keep us posted on any further observations.

@03byron
Copy link
Author

03byron commented Dec 5, 2022

From the above, do I understand it correctly that you observe the issue when the consumer subscribed to multiple partitions at the same time but not when subscribed to a single one?

Correct.

Please note that events are ordered in each partition individually but not across all partitions, i.e., sequence numbers are meaningful only within the context of a single partition. For a single processEvents call, the input event list comes from a single partition so I am puzzled that the issue happens when connecting to multiple partitions but not to one.

Yes, we are aware of this. One explanation might be the load. A subscription for a single partition has to process a lot less events than one that subscribes to all partitions. Over the last three days I can see the error only on those regions with the highest load.

Another observation which is not yet fully validated: We stumbled upon the issue because we saw that events that tried to update an entity failed because the create event for that entity was processed after the update. We see this pattern in all our affected nodejs implementation. However, we don't see it in java based implementations of the same stream. We will add explicit detection of the order and validate this observation.

@thomasstoermer
Copy link

We have been able to reproduce the issue with a simple test client:

  1. Create event hub with 32 partitions
  2. Produce 20,000 events on each partition
  3. Start a consumer processing all 32 partitions using:
    • SubscribeOptions described below
    • processEvents handler described below using a random short processing delay simulation
  4. After several thousand events the error should occur
export async function processEvents(events: ReceivedEventData[], context: PartitionContext) {
    if (events.length === 0) {
        return;
    }
    if (
        events.some((event, index) => {
            return (
                !!events[index + 1] && // its not the last event in the list
                event.sequenceNumber > events[index + 1].sequenceNumber // Current sequence number is larger than the next one
            );
        })
    ) {
        console.error(`Events received in wrong order: ${events.map(({ sequenceNumber }) => sequenceNumber).join(' ,')}`);
    }
    const randomDelay = randomNumber(1,10) * 10;
    // simulate processing delay
    return new Promise<void>((resolve) => {
        setTimeout(() => {
            resolve()
        }, randomDelay);
    });
}

function randomNumber(min: number, max: number) {
    return Math.floor(Math.random() * (max - min + 1) + min)
}

SubscribeOptions:

const subscribeOptions: SubscribeOptions = {
    maxWaitTimeInSeconds: 2,
    maxBatchSize: 100,
    startPosition: earliestEventPosition
}

@03byron
Copy link
Author

03byron commented Feb 7, 2023

Update from the internal support ticket: There is a race condition in the js sdk, only. Recommended workaround until there is a fix is to set maxWaitTimeInSeconds to 60. We can confirm that this reduces the frequency of the error but it does not solve the issue completely.

@deyaaeldeen deyaaeldeen added this to the 2023-03 milestone Feb 22, 2023
@deyaaeldeen deyaaeldeen moved this to In Progress in Azure SDK for Event Hubs Feb 22, 2023
deyaaeldeen added a commit that referenced this issue Feb 24, 2023
# Re-implementing the Event Receiver
This PR re-implements the event receiver using promises and a single
queue to fix an ordering issue and to correct waiting behavior.
## Problem Statement [Issue #23993]
A customer reported that the list of events passed into the
`processEvents` callback is not always ordered by `sequenceNumber`. This
leads to processing the events in a wrong order. The customer provided a
sample that prints an out of order message when the `sequenceNumber` of
received messages is not in order and I confirm that I see the message
printed sometimes.
## Analysis
The customer-provided callback, `processEvents`, gets called every time
a batch of events is received from the service. This batch is coming
from a single partition. Events are ordered within a partition by their
`sequenceNumber`, and events received by `processEvents` should be in
the same order. However currently, the list of events the
`processEvents` callback gets called on is not always in-order. Upon
further investigation, it was found that the library implements a
complex logic to read events from the service. It maintains two queues
for reading events, one for building a batch of events that will be sent
to the next call of the `processEvents` callback, and another for when
errors occur or there are no active listeners. The coordination to read
events from the two queues is subtle and is the source of the ordering
bug.
## Re-design
The most straightforward way to simplify this design and to ensure
ordering is to use a single queue and add incoming events to it in the
order they're received. Reading from this queue is as simple as the
following:
- If the queue contains any events, check if their count is already the
`maxMessageCount` or more:
  - If yes, remove `maxMessageCount` events and return them immediately
- If no, wait for a few milliseconds and then remove up to
`maxMessageCount` and return them
- If the queue doesn't contain any events, wait until the
`maxWaitTimeInSeconds` and then return an empty list, or until one or
more event arrive and then return those

### Abstraction

The idea is concisely captured by `waitForEvents`, a newly introduced
function that races a list of promises, one for each of the scenarios
listed above:
https://github.com/Azure/azure-sdk-for-js/blob/10826927554e7254dce0a4849f1e0c8219373522/sdk/eventhub/event-hubs/src/eventHubReceiver.ts#L733-L739

The first promise resolves right away and is returned if the queue
already has `maxMessageCount` events or more. It corresponds to the
first scenario listed above.

The second promise is created by the `checkOnInterval` function. The
promise is resolved only if the queue has any events in it. Otherwise,
it keeps checking every number of milliseconds. Note that chained to it
is a timer promise that waits another number of milliseconds to give the
service a chance to send more events. This corresponds to the second
scenario listed above.

The third promise is a simple timer promise that is resolved after the
`maxWaitTime` has elapsed. This promise corresponds to the third
scenario.

### Rewrite

In addition to some other minor improvements, the `receiveBatch` method
is concisely rewritten using that abstraction as follows:
https://github.com/Azure/azure-sdk-for-js/blob/10826927554e7254dce0a4849f1e0c8219373522/sdk/eventhub/event-hubs/src/eventHubReceiver.ts#L578-L628

Notice that the chain of promises makes the algorithm simple to read: a
link is established first, credits are added to it as needed, and then
the waiting starts.

Also, notice that at this point, no actual events were read from the
queue yet, all what this does is waiting until one of the promises
resolve. The actual reading from the queue is thened to that chain so
that it happens only after everything else is said and done. For
example, if an error occurred, it should be handled and we don't want to
prematurely mutate the queue. The reading from the queue is as simple as
the following:
https://github.com/Azure/azure-sdk-for-js/blob/10826927554e7254dce0a4849f1e0c8219373522/sdk/eventhub/event-hubs/src/eventHubReceiver.ts#L630

## Other changes

### Exporting `core-util`'s `createAbortablePromise`

This function was added in
#24821 and proved to be
useful in this re-write so I am exporting it. I am planning on using it
in core-lro too.

### Updating tests

There are two tests updated, one for authentication and one for
returning events in the presence of retryable and non-retryable errors.

In the former, the receiver is expected to receive events after the auth
token has been invalidated but not yet refreshed. However, I am
observing that a disconnected event has been received at that moment and
the receiver has been deleted. The old receiver's behavior is to
continue receiving despite the deletion but the new one's behavior
correctly cleans up the receiver. I deleted this expectation for now.

In the latter, the test forces an error on the receiver after 50
milliseconds but the receiver already finishes around 40 milliseconds,
so I updated the forced error to happen sooner, at 10 milliseconds:
https://github.com/Azure/azure-sdk-for-js/blob/10826927554e7254dce0a4849f1e0c8219373522/sdk/eventhub/event-hubs/test/internal/receiveBatch.spec.ts#L107

Finally, a couple test suites were added for `waitForEvents` and
`checkOnInterval` functions.

## Updates in action

Live tests succeed
[[here](https://dev.azure.com/azure-sdk/internal/_build/results?buildId=2201768&view=results)].
Please ignore the timeout in the deployed resources script in canary, it
is an unrelated service issue, see
[[here](https://dev.azure.com/azure-sdk/internal/_build/results?buildId=2198994&view=results)].

A log for how the updated receiver behaves when used by the customer
sample can be found in
[log2.txt](https://github.com/Azure/azure-sdk-for-js/files/10775378/log2.txt).
Notice that the out of order message was never printed.

## Reviewing tips

The changes in `eventHubReceiver.ts` are too many and the diff is not
easily readable. I highly suggest to review
1082692
instead because it is on top of a deleting commit so there is no diff to
wrestle with. The main changes are in `receiveBatch` but please feel
free to review the rest of the module too.
@deyaaeldeen
Copy link
Member

Hi @03byron, thanks for your patience! @azure/event-hubs@5.9.0 has been released today and it contains a fix to this issue. Could you please upgrade to it and confirm whether the issue is indeed fixed?

@deyaaeldeen deyaaeldeen added the needs-author-feedback Workflow: More information is needed from author to address the issue. label Mar 7, 2023
@deyaaeldeen deyaaeldeen moved this from In Progress to Done in Azure SDK for Event Hubs Mar 7, 2023
@03byron
Copy link
Author

03byron commented Mar 10, 2023

Hi @deyaaeldeen, the issue is indeed solved with version 5.9.0. I don't see any occurrences since we updated two days ago. Thank you.

@ghost ghost added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels Mar 10, 2023
@deyaaeldeen deyaaeldeen added the issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. label Mar 10, 2023
@ghost
Copy link

ghost commented Mar 10, 2023

Hi @03byron. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve” to remove the “issue-addressed” label and continue the conversation.

@ghost ghost removed the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Mar 10, 2023
@ghost
Copy link

ghost commented Mar 17, 2023

Hi @03byron, since you haven’t asked that we “/unresolve” the issue, we’ll close this out. If you believe further discussion is needed, please add a comment “/unresolve” to reopen the issue.

@ghost ghost closed this as completed Mar 17, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Jun 15, 2023
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
Status: Done
Development

No branches or pull requests

5 participants