-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve error handling for FASTER IO completion callbacks #349
Conversation
…s prior to callback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor questions
} | ||
catch (Exception ex) | ||
{ | ||
this.BlobManager.StorageTracer?.FasterStorageError($"{nameof(CancelAllRequests)} for access id={id} failed during FASTER completion callback", ex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to re-throw here? Aren't we swallowing the exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no one to meaningfully catch exceptions here. And the partition has already been terminated (that's why the requests are being cancelled). So it is just about tracing at this point.
.ContinueWith((Task t) => | ||
{ | ||
if (this.pendingReadWriteOperations.TryRemove(id, out ReadWriteRequestInfo request)) | ||
{ | ||
if (t.IsFaulted) | ||
{ | ||
this.BlobManager?.StorageTracer?.FasterStorageProgress($"StorageOpReturned AzureStorageDevice.ReadAsync id={id} (Failure)"); | ||
request.Callback(uint.MaxValue, request.NumBytes, request.Context); | ||
} | ||
else | ||
{ | ||
this.BlobManager?.StorageTracer?.FasterStorageProgress($"StorageOpReturned AzureStorageDevice.ReadAsync id={id}"); | ||
request.Callback(0, request.NumBytes, request.Context); | ||
} | ||
} | ||
}, TaskContinuationOptions.ExecuteSynchronously); | ||
// we are not awaiting this task because it uses FASTER's callback mechanism | ||
// when the access is completed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't immediately see why it's safe to remove this logic. Mind elaborating? Perhaps it's related to your comment that this task isn't awaited, but I did not fully get that either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the logic was not removed, just placed somewhere else. These callbacks now happen right after the access completes. It should be functionally mostly the same but I prefer to not use the tricky task stuff and instead just write out the straight line code when possible.
src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/AzureStorageDevice.cs
Outdated
Show resolved
Hide resolved
if (this.underLease) | ||
{ | ||
this.SingleWriterSemaphore.Release(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please add comment on why this is important to do at this point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we not have a forceful semaphore release if it hangs for too long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lease add comment on why this is important to do at this point
ok
do we not have a forceful semaphore release if it hangs for too long?
the hang check is somewhere else. We don't need it here since the entire object (incl the semaphore) gets disposed when the partition is terminated. Also, have not seen Azure Storage hanging recently.
src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/AzureStorageDevice.cs
Outdated
Show resolved
Hide resolved
…nonzero error code.
Sorry for the churn but I did a substantial refactor of this. The error handling path is simplified compared to before. A significant change is that I am no longer calling cancellation callbacks into FASTER at all. I noticed that those cancellation callbacks were causing a lot of exceptions and it turns out they are not needed. |
…ion is terminated
After running more tests I added two more changes:
|
…ues with hanging dispose calls
# Conflicts: # src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/AzureStorageDevice.cs
Two changes: