-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QUERY] EventHub Throttle and TU relationship / Latency details / Partition and TU relationship #11034
Comments
Event Hubs has support for OpenTelemetry tracing. For more details, you can take a look at Azure Core Tracing library. Here's a sample of enabling tracing for publishing an event. @serkantkaraca and @JamesBirdsall could you please take a look at the other two questions @shubhambhattar has posted above? |
Regarding batch throttling question; service doesn't throttle first message. So you can send 2000 messages in a batch with 1 TU just fine. Next send attempt however will be throttled as expected. Regarding per partition throttling question; service doesn't enforce throttling per partition. 1 MB/sec per partition is just a design recommendation. Depending on various factors - such as network latency and speed, service and client side resource states - clients can achieve to send more than 1 MB/sec traffic to each partition just fine. |
@serkantkaraca For the second question: So, it means that the behavior of #TU > #partitions (across the namespace) will vary and I just got lucky that in my case, performance improved? |
@srnagar Thanks for letting me know about the Tracing library. I'll check that. |
@serkantkaraca @JamesBirdsall Continuing my above comment (here), I also found this on FAQ section of eventhub: https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-faq#what-are-event-hubs-throughput-units where it states that:
Does this mean that the 1000 message limit is for those messages which are 1 KB in size? Also, I don't know if this is the desired behavior but Any more clarification on this would be appreciated. |
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jfggdl. |
@serkantkaraca Also noticed this interesting trend where if I leave the application running for a long time, #messages being pushed decreases and throttling becomes 0. The new numbers still doesn't fit with the constraints (like I am pushing data in an EventHub with 16 partitions and 20 TU across the namespace, namespace has < 20 partitions) but I am still not able to push more than 13K messages / sec. |
Sorry, seems I didn't get notifications for new answers in this thread and just saw them now. With 20 TU, you should be able to push at minimum 20K messages per second. This can be a client side issue that needs investigation. Can you try couple things?
|
|
Can you measure with single client first and see if its performance degrades over time? It seems the publisher traffic drops instantly, not trends down over some time. I also wonder if publishers get stuck and stop sending completely. Are you able to chart traffic per client? See if any of the clients stopped sending completely. |
@serkantkaraca This comment was actually for single client itself and the performance did degrade over time. Yes, the traffic drops suddenly but if you notice in the graph, at exactly same point throttling also stops. And everything kinds of achieves stability at around 13 K (which should atleast be >= 16K as I've 16 partitions and 20 TU). I didn't found any trace of publisher getting stuck in general, the whole applications just keeps sending less and less traffic as time passes. |
I didn't find an instance where application stops sending these days. |
@serkantkaraca I can still see the trend. I restarted my client (only a single client this time) and let it run for 2 days and the graph keeps going down as the days pass. But I believe this is the cause of this issue, as I'm consuming from one EventHub and pushing to another. I'll hold this for some time, until the linked issue is resolved and start my experiment again after that. |
Sorry for late response. I seem to have missed notifications in my inbox regarding you replies. Since this is still being investigated, can you try couple more things which can help to point where the slowdown is happening?
|
@shubhambhattar Have you tried out above suggestions from @serkantkaraca ? |
@srnagar @serkantkaraca Sorry for the delay in response. This testing has been on and off lately. I can give you the current updates. I've a test code running for EastUS region and the TU set on the namespace is 20. There's only one EH with 32 paritions and some test data is being put continuously. Regarding point (1), details are available in the above image. The WARN and ERROR logs are below:
In the above case, messages are being pushed in batches where each batch's maximum size is (2) and (3) couldn't be done. |
@shubhambhattar, thanks for providing new test data. Can you send me your test namespace so I can check service side metrics and failures? You can reach me from serkar@microsoft.com |
Service side metrics also showing 15K events/sec ingress. Failures should be intermittent which you can ignore for now. Better if we focus on your performance concerns. Which part of the testing time frame you observed degraded performance? |
@serkantkaraca In the newer SDK, didn't observe any significant degradation in performance (the producer is almost constantly sending at 15K events / sec, each event of size |
@shubhambhattar, so we are good to close this issue and track the new issue only? |
Query/Question
I've multiple queries:
1 TU
is1 MB / sec
or1000 msgs / sec
(I read it somewhere written as1000 API calls / sec
) whichever happens first. Imagine I've message of size500
bytes, then I would be able to create anEventDataBatch
containing~2000
msgs (less than 2000 but close to it) and I would be able to send it to EventHub with 1 send calleventProducerClient.send(eventDataBatch)
. In this case, the size ofeventDataBatch
will be ~ 1MB (less than 1 MB but close to it) and I am making 1 API call (but sending ~2000 msgs in that call). Will my request be throttled?Or put in another way, If I know that my per message size is < 1 KB, should I still limit the
eventDataBatch
to only 1000 messages (and utilizing only half of 1MB / sec)?And if the requests are being throttled, how is the application supposed to know about this? There is only a WARNING log. I raised the relevant BUG here: [BUG] EventHubProducerClient is being throttled but not informing the calling application. #11003
Is there a way to know the time EventHub SDK takes to push my
Event
(orEventDataBatch
) to EventHub? I currently have no latency information from SDK. I am calculating it in my own code right now like this:Is this how this is supposed to be done? Also, what is the expected latency while pushing data (one
Event
/EventDataBatch
) to EH?I am trying out EventHub SDK Consumer and Producer in a sample application where I consumer from an EventHub A (having 32 partitions, loads of data available, reading from
EventPosition.earliest()
and not storing checkpoints) and pushing the messages unmodified to another EventHub B having 5 partitions. Since each partition can only be maxed out with 1 TU, it should be pointless to have more than 5 TU on EventHub B. However, if I enable Auto-Inflate (with max TU allowed at 20) and keep my Consumer and Producer running, it inflates my EventHub to 20 TU and I can see significant gain in performance (more than double than keeping TU at 5).I am not able to understand this because no partition can utilize the 15 extra TU that are being allocated by Auto-Inflate feature. Just to point it out EventHub B is the only EventHub in that namespace. So the producer EventHub namespace overall only has 5 partitions.
Why is this not a Bug or a feature Request?
I couldn't categorize it as bug / feature request because I might be missing a few details in my understanding.
Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report
The text was updated successfully, but these errors were encountered: