-
Notifications
You must be signed in to change notification settings - Fork 872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP - Improved HA stability #9854
Conversation
@timw if OrientDB supported database Alias, full database restoration would be made easier: This means that a database could be accessed through it's alias name. Swapping database ALIAS between database, would be the most effective way to restore a database. In a distributed mode, the restored database should be propagated across the nodes. What are your thoughts about this? Another thing: Kind regards, Joao |
Hi @timw, Your observations and fixes are spot on, this work looks amazing, there are additional refactor as well that we are slowly working on on our side to make all more stable, like unify the ODistributedDatabaseImpl lifecycle with the OSharedContext lifecycle, that should help to make sure opening/closing/installing of a database would work in a more reliable way! In the specific of this pull request, has a lot of changes inside, and probably will be better apply some of the changes one by one, to reduce risk and complexity, so if you can could you please split this PR in multiple PR? I will list some commits that I saw in this PR that could merged strait away in a independent PR: 1: if you could open some PR which each of them this commits, I could merge them straight away, then for the rest of the commits that require additional checking and testing we can proceed later on with review and merges, also after the merging of some of the listed commit this PR will become smaller. Regards |
Happy to do some separate PRs for the independent bits (part of the reason for this WIP was to have that discussion). |
800c193
to
1642367
Compare
@tglman - I've created the separate PRs now. |
de16b79
to
8b6c929
Compare
I've rebased on the head of 3.1.x now, with the separately merged commits removed now. |
8b6c929
to
36d255f
Compare
Split loading of enabled server plugins and starting of plugins to allow presence of a distributed server manager to be detected prior to network listeners being established and storage open. This allows guarding of constructs that require the distributed plugin to be present and running, which currently experience a race condition between the network listeners starting and the distributed plugin fully starting.
The current usages of openNoAuthenticate include cases (like DB delta/full syncs) that need to bypass not only auth checks but distributed online status.
Errors in unbound tasks in executors that are launched from common points (e.g. OrientDBEmbedded#execute) are hard to trace. This change allows a task ID to be associated with each execution, which will be reported on any exception, and if debug logging is enabled, a full stack trace identifying the launching call site will be attached.
This allows improved logging and tracing consistency over general use of new Thread()
This prevents storage tasks that require the distributed status to be online from accessing distributed lifecycle objects that have not yet been set up (which shows up as NPEs during execution).
This avoids accesses to uninitialised distributed state during initial database setup from cluster.
Prefer attempting to cancel task before execution before waiting. Also removes double logging of execution exception, and avoids problem where get cannot be called after cancel.
Provide tracing overrides to aid in tracking async errors.
Allows registering live updates to succeed when distributed plugin not online.
View update uses distributed state, which can break if view update occurs during a distributed state change, breaking the update loop.
ODistributedDatabaseImpl construction registered the instance, leaking the this reference, and shut down the previous instance if present. The previous instance may not have been constructed fully however, so shutdown could NPE, resulting in the construction of the current instance aborting with uninitialised state, which would then be picked up by other threads finding it registered in the message service. This change externalises the construction into an atomic operation in the message service, and makes the state in the distributed database impl final. The warning about needing registration because of use "further in the call chain" appears to be spurious.
…eated. If the plugin isn't online, initialisation of newly created database will fail, resulting in a partially initialised database that will break when used (usually because the schema hasn't been loaded).
…ce conditions on startup.
36d255f
to
61e3af5
Compare
@tglman - sorry, I didn't notice your last update, and got distracted with other work for a while so only just checked it today. It would also be good to get some guidance on what to do with the remaining work in this PR, once those other minor items are removed. We've been running this in production for some months now, and have had no issues or outages, so are pretty confident on stability. |
Hi, Thank you to have created the specific PRs, I merged some of them and ported the changes also to 3.2.x (also all the previous merged changes have already been ported to 3.2.x). I see in this PR are left 3 main set of changes
For the executors, what is the scope of this specialized executor ? looking at the code it seems that it add additional logging to make sure to correlate the error with the source caller, am I getting it right ? This is cool but is not free so I'm pro about it just maybe make the tracing possible to be turned off, also I'm happy to have this in 3.2.x even though in that version all the executors are in the OrientDB context and should be turned off with it, and all the Orient global executor have been removed, but I guess is not a problem to use this tracing executor in there as well. for the view changes in the specific of 3.1.x I think we could merge it, but for 3.2.x there have been a big refactor of that logic, that should already resolve the problem that this changes are fixing, so I think is not needed to port views changes to 3.2.x For the plugin loading change i see there is trying to make sure that the detection of the distributed plugin is done earlier with the "distributedPluginEnabled" flag, I understand why this is done, I had more then a few problems on database creation on initialization and detection of distributed environment at startup, this could be ok in 3.1.x but also here I think we managed to solve this problem with a more structural approach in the 3.2.x version, so I do not know if this changes is worth to be ported. Thank you for this work anyway, it is impressively good ! |
I'll try to find some time to fix up the tracing executors soon - as you note we can avoid the callable construction when tracing is disabled (which is already detected in the For the view and lifecycle changes, I'll need to re-test against the current 3.2 head to be sure (I see a lot of changes in the database/session/distributed area that I'll have to understand as well), but at the time of creating this PR my load test app could reliably break 3.2 in a lot of the same ways that 3.1 broke. Given we're pretty stable on our 3.1 fork for now, the best way forward might be to re-do the stress testing on 3.2 and then port the still relevant fixes over to 3.2 and look at that in a separate PR. |
Hi, I did manually port the executor tracing in 3.2.x and add a global configuration to enable it for debug purpose, the rest of the part of this PR have been already solved in 3.2.x, so I'm going to close this. Regards |
Thanks for that. |
What does this PR do?
This is an in-progress set of changes we're working on to increase the stability of OrientDB in HA/distributed deployments.
We're currently testing these fixes in pre-production and then moving them to production, but are opening them up now for discussion to see if there are potentially better ways to fix these issues, and allow broader testing to see if they introduce other issues (as we don't exhaustively test all features/APIs).
Motivation
We have encountered multiple outages in production due to various stability issues related to distributed operation.
These issues are numerous and somewhat interacting, and required the development of a dedicated stress test suite to replicate in development.
A full list of the unique issues encountered during testing is too long to enumerate, covering many (many) different exceptions, but a general overview of the core issues follows:
HA STATUS -servers
query to effectively do what the enterprise agent APIs can achieve.Related issues
There's a patch in this PR that artificially widens the distributed plugin startup time - we found that this allows easier reproduction of the production issues we observe. The cause of this is that the distributed plugin makes TCP connections to remote nodes during startup, which in our production case is cross AZ in AWS and thus has higher latencies than a local data-centre, which increases the window in which some of the startup issues occur.
During testing/resolving these issues, multiple enhancements were made to logging/tracing:
The stress test tool we have developed is now available at [https://github.com/indexity-io/orientdb-stress]. It's open source licensed, but we've kept it distinct from OrientDB as it needs to run across multiple versions/patches and that doesn't work well in-tree. It currently requires some of the patches in this branch to run successfully.
Additional Notes
There is an additional class of issues that this branch does not currently fix, which is related to storage closures during database installation while in-flight transactions have already opened the database. This causes transaction errors due to closed in-memory and on-disk structures, and often leads to cascading database installs, failed updates and (in rarer situations) lost updates.
We have some fixes designed for this issue, but are debating whether it's worth developing them further as they are not observed with the enterprise agent deployed (the full database sync in the enterprise edition does not close storage for backup/remote install, and so does not encounter these problems).
I've ported these changes to 3.2 and tested to the point that I'm fairly confident that they can be reproduced in that branch and solve the same issues - 3.2 already had some changes that 3.1 did not have that try to address some of these issues, but fail under stress testing without the fixes in this branch. I've paused that work for now until the 3.1 changes can be discussed and made stable.
There are additional issues in 3.2 that will need to be addressed (creating databases currently fails in a distributed cluster soon after startup) that cause problems for the stress test tool, and 3.2 also suffers from the issues with database storage closure under load (in particular I've observed lost updates on some occasions).
Checklist
[x] I have run the build using
mvn clean package
command[x] My unit tests cover both failure and success scenarios