-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Events passed to processEvents are not always ordered by sequenceNumber #23993
Comments
Thanks for the report! Could you share the subscription id, EH namespace, and the Event Hub name, in addition to an approximate time window when the behavior was observed? |
Hi @deyaaeldeen - thank you for the reply. We provided the requested details via a service ticket (ID 2212010050000732). Some findings worth sharing in the meantime:
Because of the above observations our assumption is a race condition in the js sdk. |
Thanks a lot for sharing the details and the additional observations!
From the above, do I understand it correctly that you observe the issue when the consumer subscribed to multiple partitions at the same time but not when subscribed to a single one? Please note that events are ordered in each partition individually but not across all partitions, i.e., sequence numbers are meaningful only within the context of a single partition. For a single I will ask the service team to take a look. In the meantime, please keep us posted on any further observations. |
Correct.
Yes, we are aware of this. One explanation might be the load. A subscription for a single partition has to process a lot less events than one that subscribes to all partitions. Over the last three days I can see the error only on those regions with the highest load. Another observation which is not yet fully validated: We stumbled upon the issue because we saw that events that tried to update an entity failed because the create event for that entity was processed after the update. We see this pattern in all our affected nodejs implementation. However, we don't see it in java based implementations of the same stream. We will add explicit detection of the order and validate this observation. |
We have been able to reproduce the issue with a simple test client:
export async function processEvents(events: ReceivedEventData[], context: PartitionContext) {
if (events.length === 0) {
return;
}
if (
events.some((event, index) => {
return (
!!events[index + 1] && // its not the last event in the list
event.sequenceNumber > events[index + 1].sequenceNumber // Current sequence number is larger than the next one
);
})
) {
console.error(`Events received in wrong order: ${events.map(({ sequenceNumber }) => sequenceNumber).join(' ,')}`);
}
const randomDelay = randomNumber(1,10) * 10;
// simulate processing delay
return new Promise<void>((resolve) => {
setTimeout(() => {
resolve()
}, randomDelay);
});
}
function randomNumber(min: number, max: number) {
return Math.floor(Math.random() * (max - min + 1) + min)
} SubscribeOptions: const subscribeOptions: SubscribeOptions = {
maxWaitTimeInSeconds: 2,
maxBatchSize: 100,
startPosition: earliestEventPosition
} |
Update from the internal support ticket: There is a race condition in the js sdk, only. Recommended workaround until there is a fix is to set |
# Re-implementing the Event Receiver This PR re-implements the event receiver using promises and a single queue to fix an ordering issue and to correct waiting behavior. ## Problem Statement [Issue #23993] A customer reported that the list of events passed into the `processEvents` callback is not always ordered by `sequenceNumber`. This leads to processing the events in a wrong order. The customer provided a sample that prints an out of order message when the `sequenceNumber` of received messages is not in order and I confirm that I see the message printed sometimes. ## Analysis The customer-provided callback, `processEvents`, gets called every time a batch of events is received from the service. This batch is coming from a single partition. Events are ordered within a partition by their `sequenceNumber`, and events received by `processEvents` should be in the same order. However currently, the list of events the `processEvents` callback gets called on is not always in-order. Upon further investigation, it was found that the library implements a complex logic to read events from the service. It maintains two queues for reading events, one for building a batch of events that will be sent to the next call of the `processEvents` callback, and another for when errors occur or there are no active listeners. The coordination to read events from the two queues is subtle and is the source of the ordering bug. ## Re-design The most straightforward way to simplify this design and to ensure ordering is to use a single queue and add incoming events to it in the order they're received. Reading from this queue is as simple as the following: - If the queue contains any events, check if their count is already the `maxMessageCount` or more: - If yes, remove `maxMessageCount` events and return them immediately - If no, wait for a few milliseconds and then remove up to `maxMessageCount` and return them - If the queue doesn't contain any events, wait until the `maxWaitTimeInSeconds` and then return an empty list, or until one or more event arrive and then return those ### Abstraction The idea is concisely captured by `waitForEvents`, a newly introduced function that races a list of promises, one for each of the scenarios listed above: https://github.com/Azure/azure-sdk-for-js/blob/10826927554e7254dce0a4849f1e0c8219373522/sdk/eventhub/event-hubs/src/eventHubReceiver.ts#L733-L739 The first promise resolves right away and is returned if the queue already has `maxMessageCount` events or more. It corresponds to the first scenario listed above. The second promise is created by the `checkOnInterval` function. The promise is resolved only if the queue has any events in it. Otherwise, it keeps checking every number of milliseconds. Note that chained to it is a timer promise that waits another number of milliseconds to give the service a chance to send more events. This corresponds to the second scenario listed above. The third promise is a simple timer promise that is resolved after the `maxWaitTime` has elapsed. This promise corresponds to the third scenario. ### Rewrite In addition to some other minor improvements, the `receiveBatch` method is concisely rewritten using that abstraction as follows: https://github.com/Azure/azure-sdk-for-js/blob/10826927554e7254dce0a4849f1e0c8219373522/sdk/eventhub/event-hubs/src/eventHubReceiver.ts#L578-L628 Notice that the chain of promises makes the algorithm simple to read: a link is established first, credits are added to it as needed, and then the waiting starts. Also, notice that at this point, no actual events were read from the queue yet, all what this does is waiting until one of the promises resolve. The actual reading from the queue is thened to that chain so that it happens only after everything else is said and done. For example, if an error occurred, it should be handled and we don't want to prematurely mutate the queue. The reading from the queue is as simple as the following: https://github.com/Azure/azure-sdk-for-js/blob/10826927554e7254dce0a4849f1e0c8219373522/sdk/eventhub/event-hubs/src/eventHubReceiver.ts#L630 ## Other changes ### Exporting `core-util`'s `createAbortablePromise` This function was added in #24821 and proved to be useful in this re-write so I am exporting it. I am planning on using it in core-lro too. ### Updating tests There are two tests updated, one for authentication and one for returning events in the presence of retryable and non-retryable errors. In the former, the receiver is expected to receive events after the auth token has been invalidated but not yet refreshed. However, I am observing that a disconnected event has been received at that moment and the receiver has been deleted. The old receiver's behavior is to continue receiving despite the deletion but the new one's behavior correctly cleans up the receiver. I deleted this expectation for now. In the latter, the test forces an error on the receiver after 50 milliseconds but the receiver already finishes around 40 milliseconds, so I updated the forced error to happen sooner, at 10 milliseconds: https://github.com/Azure/azure-sdk-for-js/blob/10826927554e7254dce0a4849f1e0c8219373522/sdk/eventhub/event-hubs/test/internal/receiveBatch.spec.ts#L107 Finally, a couple test suites were added for `waitForEvents` and `checkOnInterval` functions. ## Updates in action Live tests succeed [[here](https://dev.azure.com/azure-sdk/internal/_build/results?buildId=2201768&view=results)]. Please ignore the timeout in the deployed resources script in canary, it is an unrelated service issue, see [[here](https://dev.azure.com/azure-sdk/internal/_build/results?buildId=2198994&view=results)]. A log for how the updated receiver behaves when used by the customer sample can be found in [log2.txt](https://github.com/Azure/azure-sdk-for-js/files/10775378/log2.txt). Notice that the out of order message was never printed. ## Reviewing tips The changes in `eventHubReceiver.ts` are too many and the diff is not easily readable. I highly suggest to review 1082692 instead because it is on top of a deleting commit so there is no diff to wrestle with. The main changes are in `receiveBatch` but please feel free to review the rest of the module too.
Hi @03byron, thanks for your patience! @azure/event-hubs@5.9.0 has been released today and it contains a fix to this issue. Could you please upgrade to it and confirm whether the issue is indeed fixed? |
Hi @deyaaeldeen, the issue is indeed solved with version 5.9.0. I don't see any occurrences since we updated two days ago. Thank you. |
Hi @03byron. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “ |
Hi @03byron, since you haven’t asked that we “ |
Describe the bug
The list of events passed into the
processEvents
callback is not always ordered bysequenceNumber
. This leads to processing the events in a wrong order. If that is intended it would be good to mention it in the documentation.It does not happen very often - we see 6 occurrences in the last 4 hours. We processed ~117k events in that timeframe.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
No error should appear in the log as events should be sorted by sequenceNumber.
Additional context
The samples we have contain all consecutive events in one call to
processEvents
. They are just not ordered appropriately.We see log messages like:
Example 1
Events received in wrong order: 78716749 ,78716747 ,78716748 ,78716750 ,78716751 ,78716752 ,78716753 ,78716754 ,78716755 ,78716756 ,78716757 ,78716758 ,78716759 ,78716760 ,78716761 ,78716762 ,78716763 ,78716764 ,78716765 ,78716766 ,78716767 ,78716768 ,78716769 ,78716770 ,78716771 ,78716772 ,78716773 ,78716774 ,78716775 ,78716776 ,78716777 ,78716778 ,78716779
Example 2
Events received in wrong order: 402880 ,402881 ,402882 ,402877 ,402878 ,402879 ,402883 ,402884 ,402885 ,402886 ,402887 ,402888 ,402889 ,402890 ,402891 ,402892 ,402893 ,402894 ,402895 ,402896 ,402897 ,402898 ,402899 ,402900
The text was updated successfully, but these errors were encountered: