Large amounts of warnings from FASTER.KV #383

ericleigh007 · 2024-04-18T20:05:36Z

We have transitioned to Netherite in our production program and for the most part things are going well, although we don't have anywhere near as much experience as with the standard backend.

We have found in some cases, we experience a good number of errors and warnings. The following seem to be generated from FASTER, not Netherite itself.

Part00 query 52eaf82R1809 attempt 2024-04-18T09:25:04.4847863Z enumeration failed with exception System.IndexOutOfRangeException: Index was outside the bounds of the array.
   at System.Collections.Generic.Dictionary`2.TryInsert(TKey key, TValue value, InsertionBehavior behavior)
   at FASTER.core.FasterKV`2.HandleOperationStatus[Input,Output,Context](FasterExecutionContext`3 opCtx, PendingContext`3& pendingContext, OperationStatus operationStatus, AsyncIOContext`2& request)
   at DurableTask.Netherite.Faster.FasterKV.ReadOnQuerySessionAsync(String instanceId, CancellationToken cancellationToken) in /_/src/DurableTask.Netherite/StorageLayer/Faster/FasterKV.cs:line 59
   at DurableTask.Netherite.Faster.FasterKV.QueryEnumeratedStates(EffectTracker effectTracker, PartitionQueryEvent queryEvent, IEnumerator`1 enumerator, Int32 pageSize, TimeSpan timeBudget, DateTime attempt)+MoveNext() in /_/src/DurableTask.Netherite/StorageLayer/Faster/FasterKV.cs:line 1047

Part05.35545933428    query 52eaf82R1796 attempt 2024-04-18T09:25:00.1713018Z enumeration failed with exception System.IndexOutOfRangeException: Index was outside the bounds of the array.
   at System.Collections.Generic.Dictionary`2.TryInsert(TKey key, TValue value, InsertionBehavior behavior)
   at FASTER.core.FasterKV`2.HandleOperationStatus[Input,Output,Context](FasterExecutionContext`3 opCtx, PendingContext`3& pendingContext, OperationStatus operationStatus, AsyncIOContext`2& request)
   at DurableTask.Netherite.Faster.FasterKV.ReadOnQuerySessionAsync(String instanceId, CancellationToken cancellationToken) in /_/src/DurableTask.Netherite/StorageLayer/Faster/FasterKV.cs:line 59
   at DurableTask.Netherite.Faster.FasterKV.QueryEnumeratedStates(EffectTracker effectTracker, PartitionQueryEvent queryEvent, IEnumerator`1 enumerator, Int32 pageSize, TimeSpan timeBudget, DateTime attempt)+MoveNext() in /_/src/DurableTask.Netherite/StorageLayer/Faster/FasterKV.cs:line 1047

We've received these with Part02, Part05, Part00, and Part13 as the prefix, over a short period, and then the errors were not seen again.

What can be the cause of this please?

-thank you

The text was updated successfully, but these errors were encountered:

TedHartMS · 2024-04-18T21:49:05Z

This is coming from within the Dictionary when the FASTER session is adding a pending IO context to it. Could two threads be using the session at the same time?

ericleigh007 · 2024-04-18T21:58:26Z

We're just using Netherite as a backend; no knowledge of how it is used, or what we, at the user level could be doing that would cause that.

We appreciate any hints, thanks.

TedHartMS · 2024-04-19T05:04:03Z

Whoops, I meant that for @sebastianburckhardt

ericleigh007 · 2024-05-03T21:23:20Z

Hi @davidmrdavid not to give you more work, but we see this one a lot, and at warning level, so unless we want to hide "warning" logs from Netherite, it will amp up our logs considerably.

Furthermore, "perhaps" it is something that should cause concern ;)

Thank you.

davidmrdavid · 2024-05-03T23:16:14Z

not to give you more work,

the job is the job :) .

@TedHartMS: regarding this -

This is coming from within the Dictionary when the FASTER session is adding a pending IO context to it. Could two threads be using the session at the same time?

The calling code on our end seems to be here (and it matches eric's stack trace above):

durabletask-netherite/src/DurableTask.Netherite/StorageLayer/Faster/FasterKV.cs

Lines 50 to 71 in f8b5634

    
           async ValueTask<FasterKV<Key, Value>.ReadAsyncResult<EffectTracker, Output, object>> ReadOnQuerySessionAsync(string instanceId, CancellationToken cancellationToken) 
        
           { 
        
               await availableQuerySessions.WaitAsync(); 
        
               try 
        
               { 
        
                   bool success = this.idleQuerySessions.TryTake(out var session); 
        
                   this.partition.Assert(success, "available sessions must be larger than or equal to semaphore count"); 
        
                   try 
        
                   { 
        
                       var result = await this.querySessions[session].ReadAsync(TrackedObjectKey.Instance(instanceId), token: cancellationToken).ConfigureAwait(false); 
        
                       return result; 
        
                   } 
        
                   finally 
        
                   { 
        
                       this.idleQuerySessions.Add(session); 
        
                   } 
        
               } 
        
               finally 
        
               { 
        
                   availableQuerySessions.Release(); 
        
               } 
        
           }

From what I can see - the sessions are obtained from a ConcurrentBag, which I expect should guarantee only 1 session can be used per thread. So, unless I'm missing something, I don't think there's multiple threads on the same session; or do you see something different? Are there alternative explanations? If not - we can look to add some more logs to help diagnose.

ericleigh007 · 2024-05-08T11:16:58Z

Update; becoming more critical as we are getting something like 50k messages in 30 minutes in production related to this.
All warnings, but how many warnings == 1 error? ;)
Is this something that is going to be investigated somewhat soon?

@davidmrdavid can this possibly be fixed by the upgrade of faster core?

davidmrdavid · 2024-05-08T22:12:44Z

Hi @ericleigh007: I sync'ed with @TedHartMS and may have an explanation and fix for this error. The details are here: #397

In short - this may be a Netherite bug, an issue with how it's calling FASTER. Please see the details in the PR, which I'll try to get reviewed soon.

davidmrdavid · 2024-05-09T00:05:12Z

@ericleigh007: I'll wait for the CI to pass in #397 and, if it does, I'll look to issue you with a private release. I have a suspicion that the change I've made in that PR is correct, but possibly not as performant (which may be convenient in your case if you're having throttling issues?). In any case - if we notice a significant perf drop, I have ideas for how to address that without affecting correctness, though the PR will be more complex (which is fine).

I'll keep you posted.

davidmrdavid · 2024-05-09T04:36:27Z

@ericleigh007: I have published the PR above as a private package here: https://durabletaskframework.visualstudio.com/Durable%20Task%20Framework%20CI/_artifacts/feed/durabletask/NuGet/Microsoft.Azure.DurableTask.Netherite/overview/1.5.2-clientQueryFix.1

To install it, you'll need to add the following key to your nuget.config:

<add key="durabletask" value="https://durabletaskframework.pkgs.visualstudio.com/734e7913-2fab-4624-a174-bc57fe96f95d/_packaging/durabletask/nuget/v3/index.json" />

I suspect the PR may not be as performant as the original implementation (though that suffered from race conditions, we think). Would you be willing to give it a try and let us know if at least you see this error go away? From there, if performance is a concern, I can refactor the PR into something more complex that should bring any lost perf back.

ericleigh007 · 2024-05-10T19:58:47Z

IN OUR production environment, we still get the IndexOutOfRange exception. However in this smaller environment where I can do the tests, we have not been able to duplicate that one.

In the test, the scale is smaller, we only get the ObjectDisposedException.

The changes in packages in the smaller environment are shown below:

Package	BEFORE	AFTER
DurableTask.Netherite	1.4.2	1.5.2-clientQueryFix.1
DurableTask.Netherite.AzureFunctions	1.5.2	1.5.2
Webjobs.Extension.DurableTask	2.13.1	2.13.2

BEFORE 25k traces with severity level warning
AFTER 7k traces with severity level warning

I have decided to send you the traces in your email, as they have internal information such as names of the resources in them.

davidmrdavid · 2024-05-10T22:29:32Z

Thanks @ericleigh007:

So, if I understand correctly, without the private package you get ~25k warnings and with the package you get ~7k; but given the difference in traffic between the two environments, we can't be certain that reduction in warnings is due to the private package. Do I have that right?

Also, is the ObjectDisposedException the error reported here: #396 or is it another issue? I just don't see the ObjectDisposedException keyword in the trace you've shared.

davidmrdavid · 2024-05-15T07:02:44Z

Hey @ericleigh007 - I took a look at those ObjectDisposedExceptions - I have made a change to my PR here (#397) that I think should ensure the semaphore is disposed correctly. I'm building a private package now to share with you, I'll keep you posted in this thread.

davidmrdavid · 2024-05-15T08:27:59Z

@ericleigh007: here's a new package: https://durabletaskframework.visualstudio.com/Durable%20Task%20Framework%20CI/_artifacts/feed/durabletask/NuGet/Microsoft.Azure.DurableTask.Netherite/overview/1.5.2-clientQueryFix.2

same instructions as before, but the version is now 1.5.2-clientQueryFix.2. Please let me know if the ObjectDisposedException continues

microsoft-github-policy-service bot added the Needs: Triage 🔍 label Apr 18, 2024

davidmrdavid added needs investigation and removed Needs: Triage 🔍 labels Apr 19, 2024

ericleigh007 mentioned this issue May 8, 2024

Error in orchestrator: Netherite backend failed to start: Operations per second is over the account limit. #396

Open

davidmrdavid mentioned this issue May 8, 2024

Complete FASTER sessions before releasing them during query processing #397

Merged

davidmrdavid mentioned this issue May 15, 2024

Perhaps longer latencies for Netherite for medium-sized payloads #384

Open

davidmrdavid closed this as completed in #397 Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large amounts of warnings from FASTER.KV #383

Large amounts of warnings from FASTER.KV #383

ericleigh007 commented Apr 18, 2024

TedHartMS commented Apr 18, 2024

ericleigh007 commented Apr 18, 2024

TedHartMS commented Apr 19, 2024

ericleigh007 commented May 3, 2024

davidmrdavid commented May 3, 2024

ericleigh007 commented May 8, 2024 •

edited

Loading

davidmrdavid commented May 8, 2024

davidmrdavid commented May 9, 2024

davidmrdavid commented May 9, 2024

ericleigh007 commented May 10, 2024 •

edited

Loading

davidmrdavid commented May 10, 2024

davidmrdavid commented May 15, 2024

davidmrdavid commented May 15, 2024

Large amounts of warnings from FASTER.KV #383

Large amounts of warnings from FASTER.KV #383

Comments

ericleigh007 commented Apr 18, 2024

TedHartMS commented Apr 18, 2024

ericleigh007 commented Apr 18, 2024

TedHartMS commented Apr 19, 2024

ericleigh007 commented May 3, 2024

davidmrdavid commented May 3, 2024

ericleigh007 commented May 8, 2024 • edited Loading

davidmrdavid commented May 8, 2024

davidmrdavid commented May 9, 2024

davidmrdavid commented May 9, 2024

ericleigh007 commented May 10, 2024 • edited Loading

davidmrdavid commented May 10, 2024

davidmrdavid commented May 15, 2024

davidmrdavid commented May 15, 2024

ericleigh007 commented May 8, 2024 •

edited

Loading

ericleigh007 commented May 10, 2024 •

edited

Loading