Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large amounts of warnings from FASTER.KV #383

Closed
ericleigh007 opened this issue Apr 18, 2024 · 13 comments · Fixed by #397
Closed

Large amounts of warnings from FASTER.KV #383

ericleigh007 opened this issue Apr 18, 2024 · 13 comments · Fixed by #397

Comments

@ericleigh007
Copy link

We have transitioned to Netherite in our production program and for the most part things are going well, although we don't have anywhere near as much experience as with the standard backend.

We have found in some cases, we experience a good number of errors and warnings. The following seem to be generated from FASTER, not Netherite itself.

Part00 query 52eaf82R1809 attempt 2024-04-18T09:25:04.4847863Z enumeration failed with exception System.IndexOutOfRangeException: Index was outside the bounds of the array.
   at System.Collections.Generic.Dictionary`2.TryInsert(TKey key, TValue value, InsertionBehavior behavior)
   at FASTER.core.FasterKV`2.HandleOperationStatus[Input,Output,Context](FasterExecutionContext`3 opCtx, PendingContext`3& pendingContext, OperationStatus operationStatus, AsyncIOContext`2& request)
   at DurableTask.Netherite.Faster.FasterKV.ReadOnQuerySessionAsync(String instanceId, CancellationToken cancellationToken) in /_/src/DurableTask.Netherite/StorageLayer/Faster/FasterKV.cs:line 59
   at DurableTask.Netherite.Faster.FasterKV.QueryEnumeratedStates(EffectTracker effectTracker, PartitionQueryEvent queryEvent, IEnumerator`1 enumerator, Int32 pageSize, TimeSpan timeBudget, DateTime attempt)+MoveNext() in /_/src/DurableTask.Netherite/StorageLayer/Faster/FasterKV.cs:line 1047

Part05.35545933428    query 52eaf82R1796 attempt 2024-04-18T09:25:00.1713018Z enumeration failed with exception System.IndexOutOfRangeException: Index was outside the bounds of the array.
   at System.Collections.Generic.Dictionary`2.TryInsert(TKey key, TValue value, InsertionBehavior behavior)
   at FASTER.core.FasterKV`2.HandleOperationStatus[Input,Output,Context](FasterExecutionContext`3 opCtx, PendingContext`3& pendingContext, OperationStatus operationStatus, AsyncIOContext`2& request)
   at DurableTask.Netherite.Faster.FasterKV.ReadOnQuerySessionAsync(String instanceId, CancellationToken cancellationToken) in /_/src/DurableTask.Netherite/StorageLayer/Faster/FasterKV.cs:line 59
   at DurableTask.Netherite.Faster.FasterKV.QueryEnumeratedStates(EffectTracker effectTracker, PartitionQueryEvent queryEvent, IEnumerator`1 enumerator, Int32 pageSize, TimeSpan timeBudget, DateTime attempt)+MoveNext() in /_/src/DurableTask.Netherite/StorageLayer/Faster/FasterKV.cs:line 1047

We've received these with Part02, Part05, Part00, and Part13 as the prefix, over a short period, and then the errors were not seen again.

image

What can be the cause of this please?

-thank you

@TedHartMS
Copy link

This is coming from within the Dictionary when the FASTER session is adding a pending IO context to it. Could two threads be using the session at the same time?

@ericleigh007
Copy link
Author

We're just using Netherite as a backend; no knowledge of how it is used, or what we, at the user level could be doing that would cause that.

We appreciate any hints, thanks.

@TedHartMS
Copy link

Whoops, I meant that for @sebastianburckhardt

@ericleigh007
Copy link
Author

Hi @davidmrdavid not to give you more work, but we see this one a lot, and at warning level, so unless we want to hide "warning" logs from Netherite, it will amp up our logs considerably.

Furthermore, "perhaps" it is something that should cause concern ;)

Thank you.

@davidmrdavid
Copy link
Member

not to give you more work,

the job is the job :) .

@TedHartMS: regarding this -

This is coming from within the Dictionary when the FASTER session is adding a pending IO context to it. Could two threads be using the session at the same time?

The calling code on our end seems to be here (and it matches eric's stack trace above):

async ValueTask<FasterKV<Key, Value>.ReadAsyncResult<EffectTracker, Output, object>> ReadOnQuerySessionAsync(string instanceId, CancellationToken cancellationToken)
{
await availableQuerySessions.WaitAsync();
try
{
bool success = this.idleQuerySessions.TryTake(out var session);
this.partition.Assert(success, "available sessions must be larger than or equal to semaphore count");
try
{
var result = await this.querySessions[session].ReadAsync(TrackedObjectKey.Instance(instanceId), token: cancellationToken).ConfigureAwait(false);
return result;
}
finally
{
this.idleQuerySessions.Add(session);
}
}
finally
{
availableQuerySessions.Release();
}
}

From what I can see - the sessions are obtained from a ConcurrentBag, which I expect should guarantee only 1 session can be used per thread. So, unless I'm missing something, I don't think there's multiple threads on the same session; or do you see something different? Are there alternative explanations? If not - we can look to add some more logs to help diagnose.

@ericleigh007
Copy link
Author

ericleigh007 commented May 8, 2024

Update; becoming more critical as we are getting something like 50k messages in 30 minutes in production related to this.
All warnings, but how many warnings == 1 error? ;)
Is this something that is going to be investigated somewhat soon?

@davidmrdavid can this possibly be fixed by the upgrade of faster core?

@davidmrdavid
Copy link
Member

Hi @ericleigh007: I sync'ed with @TedHartMS and may have an explanation and fix for this error. The details are here: #397

In short - this may be a Netherite bug, an issue with how it's calling FASTER. Please see the details in the PR, which I'll try to get reviewed soon.

@davidmrdavid
Copy link
Member

@ericleigh007: I'll wait for the CI to pass in #397 and, if it does, I'll look to issue you with a private release. I have a suspicion that the change I've made in that PR is correct, but possibly not as performant (which may be convenient in your case if you're having throttling issues?). In any case - if we notice a significant perf drop, I have ideas for how to address that without affecting correctness, though the PR will be more complex (which is fine).

I'll keep you posted.

@davidmrdavid
Copy link
Member

@ericleigh007: I have published the PR above as a private package here: https://durabletaskframework.visualstudio.com/Durable%20Task%20Framework%20CI/_artifacts/feed/durabletask/NuGet/Microsoft.Azure.DurableTask.Netherite/overview/1.5.2-clientQueryFix.1

To install it, you'll need to add the following key to your nuget.config:

<add key="durabletask" value="https://durabletaskframework.pkgs.visualstudio.com/734e7913-2fab-4624-a174-bc57fe96f95d/_packaging/durabletask/nuget/v3/index.json" />

I suspect the PR may not be as performant as the original implementation (though that suffered from race conditions, we think). Would you be willing to give it a try and let us know if at least you see this error go away? From there, if performance is a concern, I can refactor the PR into something more complex that should bring any lost perf back.

@ericleigh007
Copy link
Author

ericleigh007 commented May 10, 2024

IN OUR production environment, we still get the IndexOutOfRange exception. However in this smaller environment where I can do the tests, we have not been able to duplicate that one.

In the test, the scale is smaller, we only get the ObjectDisposedException.

The changes in packages in the smaller environment are shown below:

Package BEFORE AFTER
DurableTask.Netherite 1.4.2 1.5.2-clientQueryFix.1
DurableTask.Netherite.AzureFunctions 1.5.2 1.5.2
Webjobs.Extension.DurableTask 2.13.1 2.13.2

BEFORE 25k traces with severity level warning
AFTER 7k traces with severity level warning

I have decided to send you the traces in your email, as they have internal information such as names of the resources in them.

@davidmrdavid
Copy link
Member

Thanks @ericleigh007:

So, if I understand correctly, without the private package you get ~25k warnings and with the package you get ~7k; but given the difference in traffic between the two environments, we can't be certain that reduction in warnings is due to the private package. Do I have that right?

Also, is the ObjectDisposedException the error reported here: #396 or is it another issue? I just don't see the ObjectDisposedException keyword in the trace you've shared.

@davidmrdavid
Copy link
Member

Hey @ericleigh007 - I took a look at those ObjectDisposedExceptions - I have made a change to my PR here (#397) that I think should ensure the semaphore is disposed correctly. I'm building a private package now to share with you, I'll keep you posted in this thread.

@davidmrdavid
Copy link
Member

@ericleigh007: here's a new package: https://durabletaskframework.visualstudio.com/Durable%20Task%20Framework%20CI/_artifacts/feed/durabletask/NuGet/Microsoft.Azure.DurableTask.Netherite/overview/1.5.2-clientQueryFix.2

same instructions as before, but the version is now 1.5.2-clientQueryFix.2. Please let me know if the ObjectDisposedException continues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants