Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event Hub Client Timeouts when exceeding Event Hub limits #272

Closed
UMCPGrad opened this issue Jun 1, 2023 · 5 comments
Closed

Event Hub Client Timeouts when exceeding Event Hub limits #272

UMCPGrad opened this issue Jun 1, 2023 · 5 comments
Assignees

Comments

@UMCPGrad
Copy link

UMCPGrad commented Jun 1, 2023

When testing out our durable function app against production level load, we get the following timeout error (when initially setting the Netherite Event Hub to have 1 Throughput Unit, which can auto-inflate up to 40 Throughput Units).

Microsoft.Azure.WebJobs.Host.FunctionInvocationException: Exception while executing function: EventHubStart
---> System.TimeoutException: Client request timed out.
at DurableTask.Netherite.Client.CreateTaskOrchestrationAsync(UInt32 partitionId, TaskMessage creationMessage, OrchestrationStatus[] dedupeStatuses) in //src/DurableTask.Netherite/OrchestrationService/Client.cs:line 469
at DurableTask.Netherite.NetheriteOrchestrationService.DurableTask.Core.IOrchestrationServiceClient.CreateTaskOrchestrationAsync(TaskMessage creationMessage, OrchestrationStatus[] dedupeStatuses) in /
/src/DurableTask.Netherite/OrchestrationService/NetheriteOrchestrationService.cs:line 575
at DurableTask.Core.TaskHubClient.InternalCreateOrchestrationInstanceWithRaisedEventAsync(String orchestrationName, String orchestrationVersion, String orchestrationInstanceId, Object orchestrationInput, IDictionary2 orchestrationTags, OrchestrationStatus[] dedupeStatuses, String eventName, Object eventData, Nullable1 startAt) in /_/src/DurableTask.Core/TaskHubClient.cs:line 614
at Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableClient.Microsoft.Azure.WebJobs.Extensions.DurableTask.IDurableOrchestrationClient.StartNewAsync[T](String orchestratorFunctionName, String instanceId, T input) in D:\a_work\1\s\src\WebJobs.Extensions.DurableTask\ContextImplementations\DurableClient.cs:line 210
at RedScope.Collectors.EventCollector.Orchestration.Trigger.EventHubStart(EventData[] events, IDurableOrchestrationClient starter) in C:__w\1\s\src\RedScope.Collectors.EventCollector\Orchestration\Trigger.cs:line 63
at Microsoft.Azure.WebJobs.Host.Executors.VoidTaskMethodInvoker2.InvokeAsync(TReflected instance, Object[] arguments) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\VoidTaskMethodInvoker.cs:line 20 at Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker2.InvokeAsync(Object instance, Object[] arguments) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionInvoker.cs:line 52
at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithWatchersAsync(IFunctionInstanceEx instance, ParameterHelper parameterHelper, ILogger logger, CancellationTokenSource functionCancellationTokenSource) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 523
at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 306
--- End of inner exception stack trace ---
at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 352
at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.TryExecuteAsync(IFunctionInstance functionInstance, CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 108

However, when initially setting the Throughput Units to the max (40 Throughput Units), the Durable Function was able to run without any exceptions.

We also noticed that once the client timeout exceptions occurred, they always seemed to occur, even after manually scaling up to 40 Throughput Units

@sebastianburckhardt
Copy link
Member

Most likely this is a problem with the throughput of the EH senders. We have had several reports of this (e.g. #268 ) and I am working on a change that may help improve this throughput.

However, it would be good to double check whether this is true for your specific case. Do you have an app name and timestamp that I can look at?

@UMCPGrad
Copy link
Author

UMCPGrad commented Jun 7, 2023

I've actually been testing this in the AirGap Cloud (USNat) where we're able to run this against actual production load.

Just to give you a bit of a background, our production service has been running for a little more than a year. It's a relatively simple Durable Function App that has an Event Hub Trigger. It basically grabs events off of the event hub, does a little bit of processing on the events, and then sends the events to both a Log Analytics workspace, and also to storage accounts.

We've been using the default Azure Storage backend provider since we developed the service, and it's been pretty stable for about a year.

The main finding that I found these past few weeks of testing in USNat is that the Netherite Event Hub needs to have sufficent Throughput Units assigned to it. However, I also noticed that when running the same number of instances that we currently run in production (8 Instances), the Netherite version performs well for about 4 - 5 hours, then it starts to "slow" down, and eventually timeouts happen.

For example, I compared how long requests in our Event Hub Trigger function (storage backend vs Netherite). For some reason, the Trigger function requests start to take longer and longer in Netherite, until it slows to the point of creating timeouts.

I think this may be due to the load of the Event Hub that we are subscribing to, and I'm working to make sure we can produce the same load in our Test Environment so our Test Function App ("rs-ec-eastus-eh00-neth") shows the same behavior.

@sebastianburckhardt sebastianburckhardt added bug Something isn't working and removed needs author response labels Jun 13, 2023
@sebastianburckhardt
Copy link
Member

Traces showed that EH is delivering a corrupted packet that does not deserialize correctly. This sends everything into a tailspin (forever trying to recover but not really having a chance since it just receives the same corrupt packet).

There is a good chance this is a bug in the EH sender code. Since I am just about to replace this code anyway (with the solution to #268) I am prioritizing the fix for #268 which should then also fix this case (and hopefully improve performance as well).

@sebastianburckhardt
Copy link
Member

The latest release 1.4.0 now addresses this issue. It contains a blob-batching optimization #275 that replaces most of the EH sender code (including the snippet you included). The idea is to detect cases where the data volume or the number of messages is not small, and then use Azure blobs to store those messages, just passing the blob reference through EH, instead of sending all the data through the EH which has been a bottleneck.

I think there is a good chance this will fix the issue you observed here. If not, let me know and I will take another look.

@sebastianburckhardt sebastianburckhardt added needs author response and removed bug Something isn't working labels Jul 8, 2023
@lilyjma
Copy link
Member

lilyjma commented Sep 13, 2023

Hi @UMCPGrad, did you get a chance to try the latest release Sebastian mentioned? We'd love to know whether that resolved the problem you saw. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants