-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perhaps longer latencies for Netherite for medium-sized payloads #384
Comments
I have done some more tests on this and my original hypothesis seems correct. Netherite is taking longer to start orchestrators and activiites.
Our code has to transmit this information down between orchestrators and activities because it is operating on cosmos data that we're cobbled together from various document types into a materialized object at the higher levels and is commonly used in the lower in let's say 10 places. Ergo, fetching it back from cosmos would be 10x more expensive than getting it once. And for better or worse, we're not able to change that now, as we're expected to go live in less than a month. @sebastianburckhardt you added blob storage to avoid transmitting large messages across Event hubs for me. Is there a configuration element we can tweak that would cause that logic to help us in our quest for lower latency here? As a first step, I'm sure you'll ask me just how "large" our messages are at max, how large the average is, and I would turn that back to you to ask if there is something in you know of that could be used to report that info with little fuss. 🙏 |
I wouldn't mind if this were answered more quickly. |
Hey @ericleigh007: sorry for the delay, it's been busy :-) .
This is an interesting theory, at a glance I see the logic, but I can't immediately confirm that this is the culprit. I do know that @sebastianburckhardt implemented a related optimization here (#275) but by scanning through the PR, I don't see any top-level configuration knobs.
Good guess. It may be possible we already have telemetry for this - can you provide me with your app name, an affected instanceIDs, and timerange in UTC that can be used to dig deeper? |
Thank you. i will collect these tomorrow and send them via your gmail if that's fine. |
Thanks @ericleigh007: Yes, that's fine, go ahead. Just please tag me here once sent so that I can check my email in case it did not land on the inbox. |
Hello @davidmrdavid , I have sent some information which I hope is of help to your gmail. |
So @davidmrdavid and @sebastianburckhardt I know you said there are no knobs externally to turn, but I wonder if you can point me to something "internal" (I can build the source, you know), so that I can maybe find a better setting for what I'm doing. I have a hypothesis that maybe the tradeoff point is not appropriate for our stuff, and maybe it gives Event Hubs more priority for the payload, thereby letting things take a few seconds to transfer. I'd probably rather use the blobs earlier, IF that won't blow our blob backend throughput limit. I don't think it should, since I believe the old backend's throughput hit a wall because of lack of operations (counting scans) in storage queues. Give me some ideas, please, thanks! |
Thanks @ericleigh007: Just confirming I've seen the email btw. I'll look to carve out time to see the data.
Just for transparency, Sebastian is currently out of office so unfortunately he won't be able to assist in the immediate term, and he's indeed the SME here who'd more readily be able to point you to something internal. But I'm nonetheless hoping we can make a dent here in the meantime. I do think that studying this PR (https://github.com/microsoft/durabletask-netherite/pull/275/files) seems like our best bet. In there, I see the creation of two new utility classes used to deal with large messages:
I think there's performance knobs in each we could tweak. durabletask-netherite/src/DurableTask.Netherite/TransportLayer/EventHubs/BlobBatchSender.cs Lines 44 to 50 in 0e34292
In particular, the variable durabletask-netherite/src/DurableTask.Netherite/TransportLayer/EventHubs/EventHubsSender.cs Lines 93 to 95 in f8b5634
So if you're trying to opt into the "blob optimization" earlier (I think that was your suggestion) then perhaps you can make On the BlobReceiver, I see there's some concurrency control in the semaphore (AsynchronousStorageReadMaxConcurrency), that limits the amount of azure storage concurrency, which perhaps can be naively increased as well: |
David, thanks I was wondering if these blobs from Blob sender/receiver "COULD" be a different storage account, which might give us a way that could provide faster throughput without affecting the single storage account used for the other operations of the backend. |
@ericleigh007: It may be technically possible to have a storage account just for blobs, but I worry that's a bigger change than it sounds, and even if I could provide it on a private package, I'm unsure I could officialize it in the main branch. That said, I can definitely try to provide you with a private package that tweaks some of the perf knobs I described above. That said, to avoid mixing too many variables, I'd like to first make a bit more progress here (#383 (comment)) and once that error seems resolved (I think we're close), I'll look to provide some tweaks to these settings |
@ericleigh007: I think it is perhaps too early to fine tune the parameters without some better insight in what is causing the delays. I have taken a look at the telemetry for the two-day period and the orchestration you mentioned:
It seems that 99% of all orchestrations finish within 4.26 seconds. The interesting question is what happens to the others? There is a pretty long tail (max execution time is about 15 seconds). Importantly though, these 'slow' orchestration are not a constant occurrence over the two days observed. Rather, they all originate from just three distinctive time periods, of lengths 20 minutes, 7 minutes, and 6 seconds respectively. I still have to dig deeper to find out what is causing the slowness during those times, but I would be surprised if it is related to the EH throughput. I am suspecting that it is about bugs or other faults that cause partitions to restart. |
My estimation is around 16MB for what we call the Collecting path (so orchestrators starting with Collecting....) smaller for the other flows, but I think 16MB IS a decent estimate for "frequent" use. We also have one or two orchestrators that take larger hunks of data, and one that is maybe as large as 50MB sometimes, which is a return of 25,000 "items" in a single Service Bus trigger. I also did some quick experimentation with "hidden" compression / decompression and I wondered why the current persistence stores don't use this, or at least have it as an option. |
@sebastianburckhardt: we don't have stress tests with large inputs such as the ones described above, do we? I suppose it would be easy to modify them to generate and use several MB of random data? |
We have switched to Netherite for our production workloads and we are quite pleased with the results in our testing.
However, it appears that for some size payload, the latency of calling orchestrators and activities is greater than with the same code using the default backend and we'd like to explore if there are ways to get this more in line.
Our main codebase is cored in Change feeds from Cosmos DB and the change feed immediately calls an orchestrator that then calls activity based on the state of the document. I have taken some basic A/B data for the time the change feed function takes and it appears the timing is longer for Netherite.
This is a simple test without being able to properly isolate all the factors, but wanted to know if any of you, especially @sebastianburckhardt might know of a reason. The below graph illustrates the longer latencies. The blue line represents the request duration for our main change feed for the standard backend, while the red is the duration when Netherite is used. The Netherite durations are spread over a wider range and overall at least twice the duration of the standard backend, it would seem. The data is normalized so that the scale is percentage of the total requests.
Thanks!
The text was updated successfully, but these errors were encountered: