-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Errors/Broken operations during rolling upgrade of clusters from 1.3 to 2.0 #2228
Comments
Upon investigation,
|
Looks like this was just discovered few days back - #2168 |
Hi @saikaranam-amazon, thank you for filing this issue. Currently, the best option is likely to do a full-restart update to migrate the major versions. |
Hi @saikaranam-amazon are you also running the tarball installation? Thank you. |
@scrawfor99 - Tested using docker containers with one node on 1.3 and other on 2.0 OS versions.
|
You ran the containers separately or with a compose file? I do not know if it matters but I am trying to pinpoint the cause of the issue as well as what starts are impacted. The original discussion seemed to suggest that maybe Kubernetes install was not impacted. |
[Update] These issues are not related |
I am posting some curl requests for easy reuse--I am not sure the best way of working on diagnosing the root cause so have been trying to reproduce the steps to get the error. I do not know if these are useful to anyone else if not you can ignore them: Create new index with two primary: curl -X PUT --insecure -u 'admin:admin' 'https://localhost:9200/sample-index1' -H 'Content-Type: application/json' -d '{"settings": {"index": {"number_of_shards": 3,"number_of_replicas": 1}}, "mappings": {"properties": {"age": {"type": "integer"} } }, "aliases":{"sample-alias1": {}} } Search from node 0: curl -X GET --insecure -u 'admin:admin' 'https://localhost:9200/sample-index1/_search' -H 'Content-Type: application/json' -d '{"from": 0}' |
They are different issues. BWC failures(#2221) fails on upgrading from 2.4.0 to 3.0.0, which is new. This issue is about failing rolling upgrade from 1.x to 2.x. |
This appears to be the commit that introduced the version update system: #454 If it cannot upgrade from 1.x to 2.x it may have to do with the switch from v6 security versioning to v7 as I believe that also occurred during that change? |
Update: Looking further it seems like the swap from 1.3 to 2.0 that introduced the naming convention change form openDistro to OpenSearch is likely causing the search which starts on the 1.3 node to then throw a class not found exception on the 2.0 node since opendistroforeleasticsearch.security.user.User became opensearch.security.user.User. So I think that it starts on the 1.3 node, operates search then passes the request to the 2.0 node without acknowledging the refactor between the versions. This would also align with rolling upgrades working for 1.x->1.x and 2.x->2.x but not for 1.x->2.x since this is when the refactor took place. I believe a quick work around may be to just add a catch for the ClassNotFoundException that just retries the search request using the new naming convention and combines the results from the 1.3 nodes with the 2.0 nodes. @cliu123 what do you think? I am not sure what the commands to go through the steps above ^ ( I have not yet had a chance to watch the demo video of how the requests are processed) to test this or where the search calls come from (this is where you will want to add the catch) but I hope that this points people in the right direction since I am 80% sure this is the cause. |
@scrawfor99 Thanks for the analysis. Here is my thinking based on code read. It seems like 1.3 is always operating in ODFE backward compatibility mode while "sending" data in the write path as it doesn't know whether the destination node is ES or OpenSearch node. (link). public ObjectStreamClass replace(final ObjectStreamClass desc) {
final String name = desc.getName();
if (name.startsWith(OS_PACKAGE)) {
return nameToDescriptor.computeIfAbsent(name, s -> {
SpecialPermission.check();
// we can't modify original descriptor as it is cached by ObjectStreamClass, create clone
final ObjectStreamClass clone = AccessController.doPrivileged(
(PrivilegedAction<ObjectStreamClass>)() -> SerializationUtils.clone(desc)
);
DescriptorNameSetter.setName(clone, s.replace(OS_PACKAGE, ODFE_PACKAGE));
return clone;
});
}
return desc;
}
} To support 1.3 to 2.x rolling upgrade, 2.x will have to understand 1.3 classes. I understand that we can handle ClassNotFoundException, but do you think we can avoid this additional try/catch if we handle in the descriptor itself (like 1.3 was doing)? protected ObjectStreamClass readClassDescriptor() throws IOException, ClassNotFoundException {
ObjectStreamClass desc = super.readClassDescriptor();
final String name = desc.getName();
if (name.startsWith(ODFE_PACKAGE)) {
desc = ObjectStreamClass.lookup(Class.forName(name.replace(ODFE_PACKAGE, OS_PACKAGE)));
logger.debug("replaced descriptor name from [{}] to [{}]", name, desc.getName());
}
return desc;
}
} One question - it seems like ideally we should be able to support rolling upgrade from ODFE to 2.x too and only this issue seems to be the only wrinkle. Should we consider supporting ODFE to 2.x rolling upgrade as well as many users may want to directly upgrade to 2.x? Thoughts? |
@krishna-ggk Here's some findings from a cursory look at this issue: Upgrade from ODFE to OS 2.x is not supported due to an issue with JDK 17 and illegal reflective access. See relevant PR here: #1691. setiah goes in length on the issue here: #1653 (comment) and vrozov mentions technical limitations of the solution mentioned by nknize here: #1259 (comment)
I see that the 1.3 node writes this for backwards compatibility with ODFE nodes. I'm able to reproduce with a |
@cwperks Thanks for bringing focus on this and the same applies to OS 1.x to OS 2.x upgrades too now. It appears that any reflective based solution is then ruled out. |
@krishna-ggk We may be able to conditionally apply the ODFE re-write logic depending on the mixture of nodes in the cluster for 1.3. ClusterInfoHolder (https://github.com/opensearch-project/security/blob/1.3/src/main/java/org/opensearch/security/configuration/ClusterInfoHolder.java#L86) has a mechanism for determining node versions within the cluster and it appears the logic to serialize the user object is located here ( security/src/main/java/org/opensearch/security/transport/SecurityInterceptor.java Line 236 in f431ec2
|
The other potential solution is to avoid reflection to avoid the illegal reflective access with JDK 17 if possible. |
My understanding of this issue is that it is being worked on but at the same time there is unlikely to be a trivial solution. I do not know that a fix for 2.4.0 is necessarily in the making so much as an early patch should the solution be as complicated as it appears. Correct me if I am wrong @cliu123 @cwperks |
I'll be looking into this issue. Looks like the problem we are running into is related to this code: security/src/main/java/org/opensearch/security/auditlog/impl/AbstractAuditLog.java Lines 592 to 598 in 672f4d7
Audit logs get passed around, and we are seeing this issue with interop between 1.3 and 2.X because the thread context is sent over the wire with an object with an updated namespace. Because that fancy code in Next steps:
[1]
|
Thanks @peternied. One question - while I understand that the current failure was for an instance of |
@krishna-ggk You are correct to worry that there might be other sources of failure. We are at the initial steps of this investigation and we will need an automated way to validate the cluster before. |
This is not going to be a part of 2.4.0. I've removed the 2.4.0 lable. |
I've created a way to test this scenario and others like it, but haven't been able to look into this bug itself |
@peternied I will look into this today and see if there's a possibility of getting the min node version from the ClusterInfoHolder to conditionally apply the serialization logic that replaces the package name with opendistro package name. If the min node in the cluster is OS 1, then no need to perform the rewrite logic. If the min node in the cluster is ODFE, then apply the rewrite logic. |
@cwperks This issue will happen even when min node in the cluster is OS 1. As reported in #2168 the cluster was originally setup with OS 1.x and never upgraded from ODFE. Also attaching the patch which fixed the rolling upgrade. |
@ronniepg Understood and that's what this PR is targeting: #2268 There is logic in 1.3 to keep backwards compatibility with ODFE that will always rewrite package names from |
Thanks the reference to #2268. So with this fix, the OS 1.x cluster will need to be upgraded to the latest 1.x version which has this fix before rolling upgrade to 2.x. The patch which i attached earlier was for OS 2.x. ODFE nodes and OS 1 nodes will still be able to talk to each other since like you mentioned OS 1.3 does both the serialization/deserialization. Once the cluster is fully upgraded to OS 1, then it can be rolling upgraded to OS 2. I guess the fix in OS 1.x code would be preferred to not keep any of 'opendistro' in the 2.x code base. |
Closing this as it was addressed in #2268 |
What is the bug?
Errors/Broken search results during rolling upgrade of clusters from 1.3 to 2.0
How can one reproduce the bug?
Notice that the failed shard count is 1
What is the expected behavior?
Rolling upgrade of clusters should complete without any issues.
What is your host/environment?
The text was updated successfully, but these errors were encountered: