Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUERY] EventHub Throttle and TU relationship / Latency details / Partition and TU relationship #11034

Closed
2 tasks done
shubhambhattar opened this issue May 11, 2020 · 21 comments
Closed
2 tasks done
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team.

Comments

@shubhambhattar
Copy link
Contributor

shubhambhattar commented May 11, 2020

Query/Question

I've multiple queries:

  • 1 TU is 1 MB / sec or 1000 msgs / sec (I read it somewhere written as 1000 API calls / sec) whichever happens first. Imagine I've message of size 500 bytes, then I would be able to create an EventDataBatch containing ~2000 msgs (less than 2000 but close to it) and I would be able to send it to EventHub with 1 send call eventProducerClient.send(eventDataBatch). In this case, the size of eventDataBatch will be ~ 1MB (less than 1 MB but close to it) and I am making 1 API call (but sending ~2000 msgs in that call). Will my request be throttled?

    Or put in another way, If I know that my per message size is < 1 KB, should I still limit the eventDataBatch to only 1000 messages (and utilizing only half of 1MB / sec)?

    And if the requests are being throttled, how is the application supposed to know about this? There is only a WARNING log. I raised the relevant BUG here: [BUG] EventHubProducerClient is being throttled but not informing the calling application. #11003

  • Is there a way to know the time EventHub SDK takes to push my Event (or EventDataBatch) to EventHub? I currently have no latency information from SDK. I am calculating it in my own code right now like this:

      try (com.codahale.metrics.Timer.Context ignored = latency.time()) { // dependency io.dropwizard.metrics:metrics-graphite:4.1.7
          eventHubProducerClient.send(eventDataBatch);
      }
    

    Is this how this is supposed to be done? Also, what is the expected latency while pushing data (one Event / EventDataBatch) to EH?

  • I am trying out EventHub SDK Consumer and Producer in a sample application where I consumer from an EventHub A (having 32 partitions, loads of data available, reading from EventPosition.earliest() and not storing checkpoints) and pushing the messages unmodified to another EventHub B having 5 partitions. Since each partition can only be maxed out with 1 TU, it should be pointless to have more than 5 TU on EventHub B. However, if I enable Auto-Inflate (with max TU allowed at 20) and keep my Consumer and Producer running, it inflates my EventHub to 20 TU and I can see significant gain in performance (more than double than keeping TU at 5).

    I am not able to understand this because no partition can utilize the 15 extra TU that are being allocated by Auto-Inflate feature. Just to point it out EventHub B is the only EventHub in that namespace. So the producer EventHub namespace overall only has 5 partitions.

Why is this not a Bug or a feature Request?
I couldn't categorize it as bug / feature request because I might be missing a few details in my understanding.

Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • Query Added
  • Setup information Added
@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels May 11, 2020
@shubhambhattar shubhambhattar changed the title [QUERY] EventHub Throttle and TU relationship [QUERY] EventHub Throttle and TU relationship / Latency details / Partition and TU relationship May 11, 2020
@joshfree joshfree added Client This issue points to a problem in the data-plane of the library. Event Hubs labels May 11, 2020
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label May 11, 2020
@joshfree
Copy link
Member

@srnagar @conniey PTAL

@srnagar
Copy link
Member

srnagar commented May 13, 2020

Is there a way to know the time EventHub SDK takes to push my Event (or EventDataBatch) to EventHub?

Event Hubs has support for OpenTelemetry tracing. For more details, you can take a look at Azure Core Tracing library. Here's a sample of enabling tracing for publishing an event.

@serkantkaraca and @JamesBirdsall could you please take a look at the other two questions @shubhambhattar has posted above?

@serkantkaraca
Copy link
Member

serkantkaraca commented May 13, 2020

Regarding batch throttling question; service doesn't throttle first message. So you can send 2000 messages in a batch with 1 TU just fine. Next send attempt however will be throttled as expected.

Regarding per partition throttling question; service doesn't enforce throttling per partition. 1 MB/sec per partition is just a design recommendation. Depending on various factors - such as network latency and speed, service and client side resource states - clients can achieve to send more than 1 MB/sec traffic to each partition just fine.

@shubhambhattar
Copy link
Contributor Author

@serkantkaraca
For the first question: The limitation of 1000 msgs seem odd then, because I'll never be able to utilize my TU and will always have to provision more if the message size is less than 1 KB (which is in my case).

For the second question: So, it means that the behavior of #TU > #partitions (across the namespace) will vary and I just got lucky that in my case, performance improved?

@shubhambhattar
Copy link
Contributor Author

@srnagar Thanks for letting me know about the Tracing library. I'll check that.

@shubhambhattar
Copy link
Contributor Author

@serkantkaraca @JamesBirdsall Continuing my above comment (here), I also found this on FAQ section of eventhub: https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-faq#what-are-event-hubs-throughput-units

where it states that:

Throughput in Event Hubs defines the amount of data in mega bytes or the number (in thousands) of 1-KB events that ingress and egress through Event Hubs.

Does this mean that the 1000 message limit is for those messages which are 1 KB in size?

Also, I don't know if this is the desired behavior but EventDataBatch can actually store more than 1000 events.

Any more clarification on this would be appreciated.

@srnagar srnagar assigned serkantkaraca and unassigned srnagar May 20, 2020
@srnagar srnagar added the Service Attention Workflow: This issue is responsible by Azure service team. label May 20, 2020
@ghost
Copy link

ghost commented May 20, 2020

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jfggdl.

@srnagar srnagar removed the Client This issue points to a problem in the data-plane of the library. label May 20, 2020
@shubhambhattar
Copy link
Contributor Author

https://imgur.com/a/2Ztbakw

@serkantkaraca Also noticed this interesting trend where if I leave the application running for a long time, #messages being pushed decreases and throttling becomes 0. The new numbers still doesn't fit with the constraints (like I am pushing data in an EventHub with 16 partitions and 20 TU across the namespace, namespace has < 20 partitions) but I am still not able to push more than 13K messages / sec.

@serkantkaraca
Copy link
Member

Sorry, seems I didn't get notifications for new answers in this thread and just saw them now.

With 20 TU, you should be able to push at minimum 20K messages per second. This can be a client side issue that needs investigation. Can you try couple things?

  • See if you can reproduce with some other SDK like .NET.
  • Try increasing number of clients. This will scale the traffic out on more number of connections.
  • Check bytes in and messages in metrics on the namespace dashboard at Azure portal. See if metrics are showing any anomalies.

@shubhambhattar
Copy link
Contributor Author

@serkantkaraca

  • Can only work with Java SDK :(
  • Increased the number of producers to 2. The same trend is visible.
  • The same trend is visible there also. Please check this: https://imgur.com/a/GN34UCi

@serkantkaraca
Copy link
Member

serkantkaraca commented Jun 4, 2020

Can you measure with single client first and see if its performance degrades over time? It seems the publisher traffic drops instantly, not trends down over some time. I also wonder if publishers get stuck and stop sending completely. Are you able to chart traffic per client? See if any of the clients stopped sending completely.

@shubhambhattar
Copy link
Contributor Author

@serkantkaraca This comment was actually for single client itself and the performance did degrade over time. Yes, the traffic drops suddenly but if you notice in the graph, at exactly same point throttling also stops. And everything kinds of achieves stability at around 13 K (which should atleast be >= 16K as I've 16 partitions and 20 TU).

I didn't found any trace of publisher getting stuck in general, the whole applications just keeps sending less and less traffic as time passes.

@shubhambhattar
Copy link
Contributor Author

I didn't find an instance where application stops sending these days.

@shubhambhattar
Copy link
Contributor Author

@serkantkaraca I can still see the trend. I restarted my client (only a single client this time) and let it run for 2 days and the graph keeps going down as the days pass.

But I believe this is the cause of this issue, as I'm consuming from one EventHub and pushing to another.

I'll hold this for some time, until the linked issue is resolved and start my experiment again after that.

@serkantkaraca
Copy link
Member

serkantkaraca commented Jun 23, 2020

Sorry for late response. I seem to have missed notifications in my inbox regarding you replies.

Since this is still being investigated, can you try couple more things which can help to point where the slowdown is happening?

  1. Add a monitor to you client to calculate events per minute rate. When rate drops below a certain point, recreate EH client in the same process and switch to this new client. See if throughput improves.

  2. Try running with a new namespace in some other region if possible.

  3. Try sending with Service Bus Explorer which uses .NET client. See if you can reproduce same slowdown there.

@joshfree joshfree added the Client This issue points to a problem in the data-plane of the library. label Jul 15, 2020
@srnagar
Copy link
Member

srnagar commented Aug 19, 2020

@shubhambhattar Have you tried out above suggestions from @serkantkaraca ?

@shubhambhattar
Copy link
Contributor Author

shubhambhattar commented Aug 24, 2020

@srnagar @serkantkaraca Sorry for the delay in response. This testing has been on and off lately. I can give you the current updates.

I've a test code running for EastUS region and the TU set on the namespace is 20. There's only one EH with 32 paritions and some test data is being put continuously.

https://imgur.com/a/wJcH0F9

Regarding point (1), details are available in the above image. The WARN and ERROR logs are below:

2020-08-15 01:08:08,971 [single-1] WARN  c.a.c.a.i.h.ConnectionHandler - onTransportError hostname[some-namespace.servicebus.windows.net:5671], connectionId[MF_e9cc2e_1597253463109], error[Connection reset by peer]
2020-08-15 01:08:08,972 [single-1] ERROR c.a.c.a.i.ReactorConnection - connectionId[MF_e9cc2e_1597253463109] Error occurred in connection handler.
Connection reset by peer, errorContext[NAMESPACE: some-namespace.servicebus.windows.net]
2020-08-15 01:09:08,977 [single-1] WARN  c.a.m.e.i.EventHubConnectionProcessor - Retry #1. Transient error occurred. Retrying after 4511 ms.
Connection reset by peer, errorContext[NAMESPACE: some-namespace.servicebus.windows.net]
2020-08-22 20:19:52,216 [single-1] WARN  c.a.c.a.i.ReactorSender - entityPath[dummy], linkName[dummy], deliveryTag[9080090bed6d44f4a27ac0369654e4e2]: Delivery rejected. [Rejected{error=Error{condition=amqp:internal-error, description='The service was unable to process the request; please retry the operation. For more information on exception types and proper exception handling, please refer to http://go.microsoft.com/fwlink/?LinkId=761101 Reference:09c23d8c-1dc9-414b-ba56-3c4a91c43500, TrackingId:6581a5f100009ae5000055315f3735bc_G6_B3, SystemTracker:some-namespace:eventhub:dummy~9215, Timestamp:2020-08-22T20:19:52', info=null}}]

In the above case, messages are being pushed in batches where each batch's maximum size is 1000000 bytes. A random byte[] array is being created and then pushed to the batch until the batch achieves the maximum size. The graph above accounts for all the messages in the batch (its no. of messages pushed / sec not no. of batches pushed / sec).

(2) and (3) couldn't be done.

@serkantkaraca
Copy link
Member

@shubhambhattar, thanks for providing new test data. Can you send me your test namespace so I can check service side metrics and failures? You can reach me from serkar@microsoft.com

@serkantkaraca
Copy link
Member

Service side metrics also showing 15K events/sec ingress. Failures should be intermittent which you can ignore for now. Better if we focus on your performance concerns. Which part of the testing time frame you observed degraded performance?

@shubhambhattar
Copy link
Contributor Author

@serkantkaraca In the newer SDK, didn't observe any significant degradation in performance (the producer is almost constantly sending at 15K events / sec, each event of size 300 bytes and pushed in batches). I did observe degradation in consumer rather than producer and for that I've opened up #14652.

@serkantkaraca
Copy link
Member

@shubhambhattar, so we are good to close this issue and track the new issue only?

@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team.
Projects
None yet
Development

No branches or pull requests

4 participants