Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable tx timeouts, add tx debug logging, static DLL pattern, fix docs #3512

Merged
merged 1 commit into from
Jan 23, 2022

Conversation

HebaruSan
Copy link
Member

@HebaruSan HebaruSan commented Jan 11, 2022

Problems

I've been investigating several transaction-related issues lately.

  • Something about trying to enlist in multiple transactions (see Registry already enlisted with tx error during install #3502)
  • The manually installed DLL scan, which happens in a tx (see Manually-installed ModName.Unity.dll detected as ModName's DLL #3506)
    • Registry.DllPattern is compiled for every call to RegisterDll, which is not great because this regex will be evaluated on every DLL in GameData
    • A comment says the regex works great for things like GameData/Foo/Foo-1.2.dll, but that's wrong, it's for GameData/Foo/Foo.1.2.dll, the former would detect the identifier as Foo-1
    • EnlistTransaction is called in RegisterDll even if one of its validation checks decides we don't need to make any changes
  • Cloning an instance can throw an exception if it takes more than 10 minutes
  • Installing mods can throw the same exception if it takes more than 10 minutes
Unhandled exception:
System.Transactions.TransactionException: Failed to roll back. ---> System.IO.IOException: The directory is not empty.

   at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
   at System.IO.Directory.DeleteHelper(String fullPath, String userPath, Boolean recursive, Boolean throwOnTopLevelDirectoryNotFound, WIN32_FIND_DATA& data)
   at System.IO.Directory.Delete(String fullPath, String userPath, Boolean recursive, Boolean checkHost)
   at ChinhDo.Transactions.FileManager.Operations.CreateDirectory.Rollback()
   at ChinhDo.Transactions.FileManager.TxEnlistment.Rollback(Enlistment enlistment)
   --- End of inner exception stack trace ---
   at ChinhDo.Transactions.FileManager.TxEnlistment.Rollback(Enlistment enlistment)
   at System.Transactions.VolatileEnlistmentAborting.EnterState(InternalEnlistment enlistment)
   at System.Transactions.TransactionStateAborted.EnterState(InternalTransaction tx)
   at System.Transactions.EnlistableStates.Timeout(InternalTransaction tx)
   at System.Transactions.Bucket.TimeoutTransactions()
   at System.Transactions.BucketSet.TimeoutTransactions()
   at System.Transactions.TransactionTable.ThreadTimer(Object state)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.TimerQueueTimer.CallCallback()
   at System.Threading.TimerQueueTimer.Fire()
   at System.Threading.TimerQueue.FireNextTimers() 

Causes

TransactionManager has a severe design flaw: Its MaximumTimeout property is 10 minutes, limits all transactions, and can't be changed by application code through normal means. When you start a transaction, a background thread is started that sleeps for 10 minutes and then tries to abort the transaction.

If you have a 1-hour operation that you want to have atomic rollback capabilities and you instantiate TransactionScope with a timeout of 1 hour plus padding, you're out of luck, all you get is 10 minutes. 🤦 And you won't find out about this until your code fails in production. 🤦 🤦 And you have no way to fix this problem once it's discovered 🤦 🤦 🤦, other than editing a machine.config XML file in your .NET Framework folder 🤦 🤦 🤦 🤦, which would be completely unrealistic to attempt across CKAN's entire user base.

When the background thread tries to abort the transaction, the ChinhDo file manager tries to restore the disk to how it was before the transaction started, deleting new files and restoring deleted ones. However, this cannot succeed on Windows because the new files will still be locked by the foreground thread while the original transaction is in progress! So the (misguided, inappropriate) attempt to abort fails, and the whole application crashes. Needless to say this is not what we would ever want, especially when the original operation is actually working fine, chugging along towards its completion.

Changes

  • log.Debug calls are added so we can track how and when we enlist and de-enlist with transactions (this is what I added for the debug component uploaded in Registry already enlisted with tx error during install #3502)
  • The manually installed DLL regex is now static, so it will not be compiled for every DLL in GameData
    • The comment now mentions the right file name format
    • The logic for translating a path to an identifier is refactored into GameInstance.DllPathToIdentifier and shared with Netkan's PluginsValidator to reduce code duplication
    • Registry.RegisterDll now only calls EnlistTransaction if all of its validation checks pass and it is going to make a change
  • Since the transaction timeout can't be disabled directly, we set it to 49 days (the maximum allowed in .NET Core) by using reflection to set private properties of TransactionManager. This will ensure that cloning an instance and installing mods will not arbitrarily fail after 10 minutes.
    • "But what if we really do end up in a deadlock or an infinite loop and we should abort?"
      Even if this happened, attempting to abort a ChinhDo-based tx from another thread will always fail on Windows. The transaction timeout simply can't solve such a problem. We would need to identify the deadlock or infinite loop and fix it through normal bug reporting means.

Fixes #3227.
Fixes #3294.
Fixes #3455.
Fixes #3483.

@HebaruSan HebaruSan added Bug Something is not working as intended Core (ckan.dll) Issues affecting the core part of CKAN Pull request Windows Issues specific for Windows labels Jan 11, 2022
@HebaruSan HebaruSan requested a review from DasSkelett January 11, 2022 18:56
@HebaruSan

This comment has been minimized.

@HebaruSan

This comment has been minimized.

Copy link
Member

@DasSkelett DasSkelett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried purposefully causing a timeout by setting timeoutMs to a lower value (and bypassing the MaximumTimeout < timeout check), but couldn't get it to work.
On the other hand, I also didn't get a timeout after 10 minutes, so I guess it does work.

@DasSkelett DasSkelett merged commit 6a3eba7 into KSP-CKAN:master Jan 23, 2022
@HebaruSan HebaruSan deleted the fix/tx-timeouts branch January 23, 2022 18:49
@HebaruSan
Copy link
Member Author

I tried purposefully causing a timeout by setting timeoutMs to a lower value (and bypassing the MaximumTimeout < timeout check), but couldn't get it to work. On the other hand, I also didn't get a timeout after 10 minutes, so I guess it does work.

In case you tried it on Linux, I'd expect different behavior because of differences in filesystem locking. If you tried it on Windows, then I dunno. 🤷

@DasSkelett
Copy link
Member

Did test on Linux, I expected the transaction thingy to still try to rollback the transaction, just not failing like on Windows.
But I guess there are more differences between the implementations.

@willflatt

This comment was marked as off-topic.

@HebaruSan
Copy link
Member Author

HebaruSan commented Apr 28, 2022

@willflatt Try a build that actually has this fix (Hubble was released 239 days before this change was merged):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment