Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: recover from trimmed chunk #2771

Merged
merged 1 commit into from
Jun 17, 2024
Merged

fix: recover from trimmed chunk #2771

merged 1 commit into from
Jun 17, 2024

Conversation

dordsor21
Copy link
Member

  • It's theoretically possible for the section FULL to return a null layer due to race condition with a trim operation
  • Locally cache result and if null, recover
  • I just had the error from Cannot load from char array because "getSection" is null #1592 again
  • This seems to have stopped the error, but adding logging did not log, so possibly some bigger bytecode changes?
  • Oh well

 - It's theoretically possible for the section FULL to return a null layer due to race condition with a trim operation
 - Locally cache result and if null, recover
 - I just had the error from #1592 again
 - This seems to have stopped the error, but adding logging did not log, so possibly some bigger bytecode changes?
 - Oh well
@dordsor21 dordsor21 requested a review from a team as a code owner June 8, 2024 15:46
@github-actions github-actions bot added the Bugfix This PR fixes a bug label Jun 8, 2024
@SirYwell
Copy link
Member

SirYwell commented Jun 9, 2024

Why is it possible for the chunk to get trimmed mid-operation?

@dordsor21
Copy link
Member Author

Why is it possible for the chunk to get trimmed mid-operation?

The same get chunk is used when processing as for future edits on that chunk so if/when the get chunk is trimmed elsewhere it's possible that the thread accessing the layer will not have visibility of the chunk having been trimmed (or reset). It's attempted to avoid this by setting the layer to EMPTY before setting the layer to null, but we cannot necessarily guarantee order of operation on that when compiled, and also cannot guarantee that it will have changed in whatever memory/cache level the chunk getting the layer is reading from.

It seems to be an error that goes away by itself and comes back occasionally which implies to me an order of operation (whether that's thread race or when it's written to X memory level) "issue"

@SirYwell
Copy link
Member

SirYwell commented Jun 9, 2024

The same get chunk is used when processing as for future edits on that chunk so if/when the get chunk is trimmed elsewhere it's possible that the thread accessing the layer will not have visibility of the chunk having been trimmed (or reset). It's attempted to avoid this by setting the layer to EMPTY before setting the layer to null, but we cannot necessarily guarantee order of operation on that when compiled, and also cannot guarantee that it will have changed in whatever memory/cache level the chunk getting the layer is reading from.

It seems to be an error that goes away by itself and comes back occasionally which implies to me an order of operation (whether that's thread race or when it's written to X memory level) "issue"

We can guarantee the order of operations by using the proper synchronization/memory model primitives. Looking through the code, most places that write EMPTY into the sections seem to synchronize on the owing object already. But in some places it synchronizes on the corresponding sectionLocks entry instead or there is no synchronization at all - maybe it would help to clean that up in some way? It even might be fine to drop any synchronization and use VarHandle and fences instead, not sure though.

@dordsor21
Copy link
Member Author

Yeah I looked at altering the synchronisation, however, we cannot perform any locking on the chunk object when attempting to access the layer for a get operation, as if the layer isn't loaded from the LevelChunkSection yet, there we do want to lock onto something, in this case being the lock we insert to the chunk - not an issue on paper (as they remove all locks and so we create our own semaphore tied to ThreadLocal), but potentially an issue on spigot? I'd have to look into where we lock onto that more I think

@NotMyFault NotMyFault requested a review from a team June 10, 2024 18:10
@dordsor21 dordsor21 requested a review from SirYwell June 10, 2024 18:10
Copy link
Member

@SirYwell SirYwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay for now, but I guess we need to rework the mechanics at some point

@dordsor21 dordsor21 merged commit 6a54c5b into main Jun 17, 2024
9 checks passed
@dordsor21 dordsor21 deleted the fix/recover-trimmed-chunk branch June 17, 2024 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bugfix This PR fixes a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants