-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix weird panama integration failure #96715
Fix weird panama integration failure #96715
Conversation
Pinging @elastic/es-search (Team:Search) |
@ChrisHegarty do you think this "fix" is better used within the gradle starup scripts or something? I don't know how we can eagerly load a module and call the necessary privileged methods. |
This change makes me wonder if we really want: #96617 These weird privilege checks concern me a bit from the Painless context. Also, I am not sure how we can test this thing reliably. Since multiple tests utilize the same JVM processes, the module may already be loaded and cached. How could we test this? |
I have tried adding the following to
Within the
|
// 1. log4j needs privileged access to get environment variable values to print out the preferred bit size | ||
// 2. loopBound proved by the module makes a privileged call `jdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK` | ||
// this is apparently cached as once its loaded, we don't run into issues again | ||
private static final float LOADING_VECTOR_MODULE = VectorUtil.dotProduct(new float[257], new float[257]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not pretty, but necessary. I think the change is ok. If the problem is causing sufficient failures it might be best to push the change as is. However, I would like to revisit this issue with the intention of eventually being in a position to remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move this to a static init block, and not retain the computed value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping a reference to the float array does not prevent the module from being "unloaded".
I would just initialize the VectorUtil class and save a class reference in a static final.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually computing a vector product should not be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, computation is needed because the issue is inside the JDK's VectorIntrinsics class, which is not initialised by just loading any API classes. You kinda need to operate on it :-(.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initialising VectorUtil would work if VectorUtilPanamaProvider had something like this:
static {
int l = AccessController.doPrivileged((PrivilegedAction<Integer>) () ->
IntVector.fromArray(IntVector.SPECIES_128, new int[] { 1, 2, 3, 4 }, 0).length());
}
IntVector.fromArray is one way to trigger the initialisation of the JDK's VectorIntrinsics class.
Which is what I would propose, if Lucene wants to workaround this too ?
Oh, we have log4j permission problems here too. :-( The permission check flowing out from the JDK is a bug. I'll get that fixed upstream. |
This is a bit surprising to me. I was hoping that it would be sufficient. If you get time @benwtrent, then maybe try running with debug enabled to hopefully identify the exact codebase that denies the permission, e.g.
Also, if there is a relatively straightforward way to reproduce, then I will investigate too. But again, if this is causing much noise please feel free to merge, and we can continue to investigate later. |
I am going to create a rally track to reproduce. The trick is that the first time the panama module is loaded is when its used in a script. All our rally tracks index the vectors. The module is loaded at index time. |
OK, trying |
@ChrisHegarty so you have steps to reproduce. On a completely new node/cluster (never have ran vector search before). Do the following console commands: You may need to increase dims if your preferredBitWidth is larger.
|
Thank you @benwtrent |
Thanks for the reproducer @benwtrent - I managed to reproduce the issue. FWIW, here are the permissions that we would need to grant to scripts to allow this to work:
I'm not overly concerned if we grant scripts these permissions, but we should check with @rjernst. diff --git a/server/src/main/resources/org/elasticsearch/bootstrap/untrusted.policy b/server/src/main/resources/org/elasticsearch/bootstrap/untrusted.policy
index fd26a66e3439..d644cda2b1f1 100644
--- a/server/src/main/resources/org/elasticsearch/bootstrap/untrusted.policy
+++ b/server/src/main/resources/org/elasticsearch/bootstrap/untrusted.policy
@@ -14,4 +14,8 @@ grant {
// needed IndyInterface selectMethod (setCallSiteTarget)
// TODO: clean this up / only give it to engines that really must have it
permission java.lang.RuntimePermission "getClassLoader";
+
+ // needed if scripts trigger the initial loading of VectorUtil
+ permission java.lang.RuntimePermission "getenv.*";
+ permission java.util.PropertyPermission "jdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK", "read";
}; |
FTR - the JDK bug https://bugs.openjdk.org/browse/JDK-8309727 |
The way I see it, we have two options here:
At this point, while crummy, I favour the former (option 1), as it is least impactful. |
// 1. log4j needs privileged access to get environment variable values to print out the preferred bit size | ||
// 2. loopBound proved by the module makes a privileged call `jdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK` | ||
// this is apparently cached as once its loaded, we don't run into issues again | ||
private static final float LOADING_VECTOR_MODULE = VectorUtil.dotProduct(new float[257], new float[257]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move this to a static init block, and not retain the computed value?
For the other issue. Make sure that log4j is initialized early on startup. This is nothing we can prevent in Lucene. |
(may happen with mmap dir, too). |
Oh yeah. It kinda surprises me that we've not already initialised log4j at this point. Even logging something from DenseVectorFieldMapper clinit / init would work. |
With my workaround in apache/lucene#12362 we could just initialize the Lucene |
I have confirmed @uschindler 's fix. Thank you! I will update this PR once we have updated our Lucene version to a release that contains his fix. Pausing work on this for now. |
We have a few other classes that we eagerly initialise, early on in startup, see |
The code there looks fine. I like the It is only important to initialize logging before that, haven't closely looked into the code! |
Combining apache/lucene#12362 with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine. I cannot approve PR.
Maybe you should also initialize MMapDircetory.class as it could also cause log4j issues. But that one is initialized very early, and it is unlikely that painless does it indirect. On the other hand: as this is done before security manager is installed, you could remove the permission to access sun.misc classes :-) |
The issue:
VectorUtil will Load the Panama Vector incubator module if users have it enabled. However, if the first time its loaded is in a restricted security context, it will fail. There are two issues:
The fix:
Eagerly load the module if supported through combining apache/lucene#12362 with
org.elasticsearch.bootstrap.Elasticsearch#ensureInitialized.
Logging details & stack traces
Logger failure
JDK leak:
In painless here is the trace a user would see: