-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Event Hub Client Timeouts when exceeding Event Hub limits #272
Comments
Most likely this is a problem with the throughput of the EH senders. We have had several reports of this (e.g. #268 ) and I am working on a change that may help improve this throughput. However, it would be good to double check whether this is true for your specific case. Do you have an app name and timestamp that I can look at? |
I've actually been testing this in the AirGap Cloud (USNat) where we're able to run this against actual production load. Just to give you a bit of a background, our production service has been running for a little more than a year. It's a relatively simple Durable Function App that has an Event Hub Trigger. It basically grabs events off of the event hub, does a little bit of processing on the events, and then sends the events to both a Log Analytics workspace, and also to storage accounts. We've been using the default Azure Storage backend provider since we developed the service, and it's been pretty stable for about a year. The main finding that I found these past few weeks of testing in USNat is that the Netherite Event Hub needs to have sufficent Throughput Units assigned to it. However, I also noticed that when running the same number of instances that we currently run in production (8 Instances), the Netherite version performs well for about 4 - 5 hours, then it starts to "slow" down, and eventually timeouts happen. For example, I compared how long requests in our Event Hub Trigger function (storage backend vs Netherite). For some reason, the Trigger function requests start to take longer and longer in Netherite, until it slows to the point of creating timeouts. I think this may be due to the load of the Event Hub that we are subscribing to, and I'm working to make sure we can produce the same load in our Test Environment so our Test Function App ("rs-ec-eastus-eh00-neth") shows the same behavior. |
Traces showed that EH is delivering a corrupted packet that does not deserialize correctly. This sends everything into a tailspin (forever trying to recover but not really having a chance since it just receives the same corrupt packet). There is a good chance this is a bug in the EH sender code. Since I am just about to replace this code anyway (with the solution to #268) I am prioritizing the fix for #268 which should then also fix this case (and hopefully improve performance as well). |
The latest release 1.4.0 now addresses this issue. It contains a blob-batching optimization #275 that replaces most of the EH sender code (including the snippet you included). The idea is to detect cases where the data volume or the number of messages is not small, and then use Azure blobs to store those messages, just passing the blob reference through EH, instead of sending all the data through the EH which has been a bottleneck. I think there is a good chance this will fix the issue you observed here. If not, let me know and I will take another look. |
Hi @UMCPGrad, did you get a chance to try the latest release Sebastian mentioned? We'd love to know whether that resolved the problem you saw. Thanks! |
When testing out our durable function app against production level load, we get the following timeout error (when initially setting the Netherite Event Hub to have 1 Throughput Unit, which can auto-inflate up to 40 Throughput Units).
Microsoft.Azure.WebJobs.Host.FunctionInvocationException: Exception while executing function: EventHubStart
---> System.TimeoutException: Client request timed out.
at DurableTask.Netherite.Client.CreateTaskOrchestrationAsync(UInt32 partitionId, TaskMessage creationMessage, OrchestrationStatus[] dedupeStatuses) in //src/DurableTask.Netherite/OrchestrationService/Client.cs:line 469
at DurableTask.Netherite.NetheriteOrchestrationService.DurableTask.Core.IOrchestrationServiceClient.CreateTaskOrchestrationAsync(TaskMessage creationMessage, OrchestrationStatus[] dedupeStatuses) in //src/DurableTask.Netherite/OrchestrationService/NetheriteOrchestrationService.cs:line 575
at DurableTask.Core.TaskHubClient.InternalCreateOrchestrationInstanceWithRaisedEventAsync(String orchestrationName, String orchestrationVersion, String orchestrationInstanceId, Object orchestrationInput, IDictionary
2 orchestrationTags, OrchestrationStatus[] dedupeStatuses, String eventName, Object eventData, Nullable
1 startAt) in /_/src/DurableTask.Core/TaskHubClient.cs:line 614at Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableClient.Microsoft.Azure.WebJobs.Extensions.DurableTask.IDurableOrchestrationClient.StartNewAsync[T](String orchestratorFunctionName, String instanceId, T input) in D:\a_work\1\s\src\WebJobs.Extensions.DurableTask\ContextImplementations\DurableClient.cs:line 210
at RedScope.Collectors.EventCollector.Orchestration.Trigger.EventHubStart(EventData[] events, IDurableOrchestrationClient starter) in C:__w\1\s\src\RedScope.Collectors.EventCollector\Orchestration\Trigger.cs:line 63
at Microsoft.Azure.WebJobs.Host.Executors.VoidTaskMethodInvoker
2.InvokeAsync(TReflected instance, Object[] arguments) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\VoidTaskMethodInvoker.cs:line 20 at Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker
2.InvokeAsync(Object instance, Object[] arguments) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionInvoker.cs:line 52at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithWatchersAsync(IFunctionInstanceEx instance, ParameterHelper parameterHelper, ILogger logger, CancellationTokenSource functionCancellationTokenSource) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 523
at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 306
--- End of inner exception stack trace ---
at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 352
at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.TryExecuteAsync(IFunctionInstance functionInstance, CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 108
However, when initially setting the Throughput Units to the max (40 Throughput Units), the Durable Function was able to run without any exceptions.
We also noticed that once the client timeout exceptions occurred, they always seemed to occur, even after manually scaling up to 40 Throughput Units
The text was updated successfully, but these errors were encountered: