-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve ByteBufferGuard in Java 11 [LUCENE-8780] #9824
Comments
Uwe Schindler (@uschindler) (migrated from JIRA) Patch: LUCENE-8780.patch LUCENE-8780.patch .../org/apache/lucene/store/ByteBufferGuard.java | 42 ++++++++++++----------
.../org/apache/lucene/store/TestMmapDirectory.java | 13 +++----
2 files changed, 31 insertions(+), 24 deletions(-)
diff --git a/lucene/core/src/java/org/apache/lucene/store/ByteBufferGuard.java b/lucene/core/src/java/org/apache/lucene/store/ByteBufferGuard.java
index 95fa17d..e3ee12d 100644
--- a/lucene/core/src/java/org/apache/lucene/store/ByteBufferGuard.java
+++ b/lucene/core/src/java/org/apache/lucene/store/ByteBufferGuard.java
@@ -17,8 +17,9 @@
package org.apache.lucene.store;
import java.io.IOException;
+import java.lang.invoke.MethodHandles;
+import java.lang.invoke.VarHandle;
import java.nio.ByteBuffer;
-import java.util.concurrent.atomic.AtomicInteger;
/**
* A guard that is created for every {@link ByteBufferIndexInput} that tries on best effort
@@ -26,8 +27,11 @@ import java.util.concurrent.atomic.AtomicInteger;
* of this is used for the original and all clones, so once the original is closed and unmapped
* all clones also throw {@link AlreadyClosedException}, triggered by a {@link NullPointerException}.
* <p>
- * This code tries to hopefully flush any CPU caches using a store-store barrier. It also yields the
- * current thread to give other threads a chance to finish in-flight requests...
+ * This code tries to hopefully flush any CPU caches using a full fence (volatile write) and
+ * <em>eventually</em> see the state change using opaque reads. It also yields the current thread
+ * to give other threads a chance to finish in-flight requests...
+ *
+ * @see <a href="http://gee.cs.oswego.edu/dl/html/j9mm.html">Doug Lea: Using JDK 9 Memory Order Modes</a>
*/
final class ByteBufferGuard {
@@ -43,11 +47,19 @@ final class ByteBufferGuard {
private final String resourceDescription;
private final BufferCleaner cleaner;
- /** Not volatile; see comments on visibility below! */
- private boolean invalidated = false;
-
- /** Used as a store-store barrier; see comments below! */
- private final AtomicInteger barrier = new AtomicInteger();
+ @SuppressWarnings("unused")
+ private volatile boolean invalidated = false;
+
+ /** Used to access the volatile variable with different memory semantics
+ * (volatile write for barrier with memory_order_seq_cst semantics, opaque reads with memory_order_relaxed semantics): */
+ private static final VarHandle VH_INVALIDATED;
+ static {
+ try {
+ VH_INVALIDATED = MethodHandles.lookup().findVarHandle(ByteBufferGuard.class, "invalidated", boolean.class);
+ } catch (ReflectiveOperationException e) {
+ throw new ExceptionInInitializerError(e);
+ }
+ }
/**
* Creates an instance to be used for a single {@link ByteBufferIndexInput} which
@@ -63,15 +75,9 @@ final class ByteBufferGuard {
*/
public void invalidateAndUnmap(ByteBuffer... bufs) throws IOException {
if (cleaner != null) {
- invalidated = true;
- // This call should hopefully flush any CPU caches and as a result make
- // the "invalidated" field update visible to other threads. We specifically
- // don't make "invalidated" field volatile for performance reasons, hoping the
- // JVM won't optimize away reads of that field and hardware should ensure
- // caches are in sync after this call. This isn't entirely "fool-proof"
- // (see #8462 discussion), but it has been shown to work in practice
- // and we count on this behavior.
- barrier.lazySet(0);
+ // This call should flush any CPU caches and as a result make
+ // the "invalidated" field update visible to other threads:
+ VH_INVALIDATED.setVolatile(this, true);
// we give other threads a bit of time to finish reads on their ByteBuffer...:
Thread.yield();
// finally unmap the ByteBuffers:
@@ -82,7 +88,7 @@ final class ByteBufferGuard {
}
private void ensureValid() {
- if (invalidated) {
+ if (cleaner != null && (boolean) VH_INVALIDATED.getOpaque(this)) {
// this triggers an AlreadyClosedException in ByteBufferIndexInput:
throw new NullPointerException();
}
diff --git a/lucene/core/src/test/org/apache/lucene/store/TestMmapDirectory.java b/lucene/core/src/test/org/apache/lucene/store/TestMmapDirectory.java
index 3d6e532..17e13f1 100644
--- a/lucene/core/src/test/org/apache/lucene/store/TestMmapDirectory.java
+++ b/lucene/core/src/test/org/apache/lucene/store/TestMmapDirectory.java
@@ -22,7 +22,7 @@ import java.nio.file.Path;
import java.util.Random;
import java.util.concurrent.CountDownLatch;
-import org.junit.Ignore;
+//import org.junit.Ignore;
/**
* Tests MMapDirectory
@@ -43,12 +43,12 @@ public class TestMmapDirectory extends BaseDirectoryTestCase {
assumeTrue(MMapDirectory.UNMAP_NOT_SUPPORTED_REASON, MMapDirectory.UNMAP_SUPPORTED);
}
- @Ignore("This test is for JVM testing purposes. There are no guarantees that it may not fail with SIGSEGV!")
+ //@Ignore("This test is for JVM testing purposes. There are no guarantees that it may not fail with SIGSEGV!")
public void testAceWithThreads() throws Exception {
- for (int iter = 0; iter < 10; iter++) {
+ final Random random = random();
+ for (int iter = 0; iter < 300; iter++) {
Directory dir = getDirectory(createTempDir("testAceWithThreads"));
IndexOutput out = dir.createOutput("test", IOContext.DEFAULT);
- Random random = random();
for (int i = 0; i < 8 * 1024 * 1024; i++) {
out.writeInt(random.nextInt());
}
@@ -58,11 +58,12 @@ public class TestMmapDirectory extends BaseDirectoryTestCase {
final byte accum[] = new byte[32 * 1024 * 1024];
final CountDownLatch shotgun = new CountDownLatch(1);
Thread t1 = new Thread(() -> {
+ final Random rnd = random();
try {
shotgun.await();
- for (int i = 0; i < 10; i++) {
+ for (int i = 0; i < 15000; i++) {
clone.seek(0);
- clone.readBytes(accum, 0, accum.length);
+ clone.readBytes(accum, 0, rnd.nextInt(accum.length) + 1);
}
} catch (IOException | AlreadyClosedException ok) {
// OK
|
Uwe Schindler (@uschindler) (migrated from JIRA) I created a pull request for easier review and perf testing (easy chackout of branch from my github repo): apache/lucene-solr#658 |
Uwe Schindler (@uschindler) (migrated from JIRA) Thats the result after 20 runs of wikimediumall with 6 searcher threads (with ParallelGC) on Mike's lucenebench:
The total runtime of each run did not change, always approx 280s per run patched and unpatched. Not sure how to interpret this. |
Michael Sokolov (@msokolov) (migrated from JIRA) I don't have a good theory, but I was curious so I ran a few tests, and one thing I saw is that if you limit to a single searcher thread, you see only the negative side of this distribution, or at least it becomes more negative. |
Dawid Weiss (@dweiss) (migrated from JIRA) Every time I read about those memory ordering models I seem to get it and two weeks later I'm again confused... So I don't know if what you propose is right or wrong. The volatile mode always seemed to me to be regulating relationships among other volatile reads/ writes (and not specifying what happens to other variable reads/ writes not covered by volatiles)... perhaps you can piggyback fences on top of that, but maybe you can't. We could fire an e-mail to concurrency dev list so that more knowledgeable folks take a peek and let us know? Just saying... Also: "until the JVM optimized away the plain read access to the boolean" – when did that happen, do you have a reference openjdk Jira issue? |
Uwe Schindler (@uschindler) (migrated from JIRA) Hi, In general the new code is a bit slower, which is more visible if you have shorter runtime, so it looks like optimizing aways the VarHandle simply takes longer, and this degrades oaverage performance. I don't know how long the benchmark runs a warmup.
That's the ssme for me. But our use-case seems to be pretty good explained in the above quotes by Doug Lea. It clearly says that a volatile write is harder than a opaque read, the volatile write is is only visible eventually to the opaque read (but never for a plain read, if it was optimized away). I was already talking with Hotspot guys: In short, there is no way to make ByteBufferGuard "safe" unless you use a volatile or with an aquire/release fence (that can be implemented also with the varhandles). When using it, I was unable to crush my VM, but slowdown is definitely there. Also manually putting in fences does not help. The main difference between plain read and opaque (as used here) is not that the produced CPU instructions are different, the difference is only how the optimizer may optimize code. With plain reads, a simple spin-loop will never end, as the hotspot compiler clearly sees that the value can't change, and because it's not volatile, opaque or whatever, he will remove it for this thread. This does not only happen in spin loops, it also happens in our code if it's executed often enough. This is not a bug, it's wanted. I tuned the testcase to crush always: Just insert a loop of 10.000 reads BEFORE you unmap, then it fails always (in Java 8). With opaque it was also doing this, but not reproducible, so there is an improvement. IMHO, the reason why we see a slowdown for some queries could be coming from that: Hotspot is removing the "if (invalidated)" check, as it cannot change in the current thread. With opaque it can't do it, so the code is slower on the long term. I will try to get some assembly output later, just have to install hsdis.dll. To make it bulletproof, we have to wait for official support by the ByteBuffer API to officially unmap (https://bugs.openjdk.java.net/browse/JDK-4724038), with our checks we cannot make it safe. Another approach would be to expand the SIGSEGV signal handler checks in the JVM to throw an InternalError (they do it now if the operating system truncates the file and so the mapped are changes). I don't know why they not generally do a SIGSEGV check and if it happens inside DirectBuffer they just throw an Exception.... (possibly delayed as in the truncated file case). So we have to decide what we want to do as a workaround:
|
Robert Muir (@rmuir) (migrated from JIRA) i imagine any slowdown only impacts stuff doing lots of tiny reads vs reading big byte[]. seems like we just decide on the correct tradeoff. personally i lean towards more aggressive safety: the user is using java, they dont expect sigsegv. just like they dont expect to run out of mmaps because its tied to gc, and they dont expect all their searches bottlenecked on one file handle, due to stupid synchronizatiom on a relative file pointer we dont even care about. |
Uwe Schindler (@uschindler) (migrated from JIRA) I did a second patch that uses AtomicBoolean instead of VarHandles. The underlying code is the same: Here is this version #2: apache/lucene-solr@master...uschindler:jira/LUCENE-8780-v2 The effect was worse, so it's not an option. But this brings me to the conclusion: The actual calls using the other memory models are not actually the problem. Also the VarHandles and AtomicBooleans are correctly optimized away, but it looks like because of the complexity of the optimizations on the lowest level, it takes much longer until it gets faster and some optimizations are not applied at all (you cannot remove opaque reads, because the memory model eventually makes changes from other threads visible). Here are the results from the AtomicBoolean:
As said before, removing the null check does not matter at all, it just makes the variance on short running tests less evident, but the average is identical. |
In #8462 we added
ByteBufferGuard
to protect MMapDirectory from crushing the JVM with SIGSEGV when you close and unmap the mmapped buffers of an IndexInput, while another thread is accessing it.The idea was to do a volatile write access to flush the caches (to trigger a full fence) and set a non-volatile boolean to true. All accesses would check the boolean and stop the caller from accessing the underlying ByteBuffer. This worked most of the time, until the JVM optimized away the plain read access to the boolean (you can easily see this after some runtime of our by-default ignored testcase).
With master on Java 11, we can improve the whole thing. Using VarHandles you can use any access type when reading or writing the boolean. After reading Doug Lea's expanation <http://gee.cs.oswego.edu/dl/html/j9mm.html> and some testing, I was no longer able to crush my JDK (even after running for minutes unmapping bytebuffers).
The apraoch is the same, we do a full-fenced write (standard volatile write) when we unmap, then we yield the thread (to finish in-flight reads in other threads) and then unmap all byte buffers.
On the test side (read access), instead of using a plain read, we use the new "opaque read". Opaque reads are the same as plain reads, there are only different order requirements. Actually the main difference is explained by Doug like this: "For example in constructions in which the only modification of some variable x is for one thread to write in Opaque (or stronger) mode, X.setOpaque(this, 1), any other thread spinning in while(X.getOpaque(this)!=1){} will eventually terminate. Note that this guarantee does NOT hold in Plain mode, in which spin loops may (and usually do) infinitely loop – they are not required to notice that a write ever occurred in another thread if it was not seen on first encounter." - And that's waht we want to have: We don't want to do volatile reads, but we want to prevent the compiler from optimizing away our read to the boolean. So we want it to "eventually" see the change. By the much stronger volatile write, the cache effects should be visible even faster (like in our Java 8 approach, just now we improved our read side).
The new code is much slimmer (theoretically we could also use a AtomicBoolean for that and use the new method
getOpaque()
, but I wanted to prevent extra method calls, so I used a VarHandle directly).It's setup like this:
I had to tune our test a bit, as the VarHandles make it take longer until it actually crushes (as optimizations jump in later). I also used a random for the reads to prevent the optimizer from removing all the bytebuffer reads. When we commit this, we can disable the test again (it takes approx 50 secs on my machine).
I'd still like to see the differences between the plain read and the opaque read in production, so maybe @mikemccand or @rmuir can do a comparison with nightly benchmarker?
Have fun, maybe @dweiss has some ideas, too.
Migrated from LUCENE-8780 by Uwe Schindler (@uschindler), updated Apr 29 2019
Attachments: LUCENE-8780.patch
The text was updated successfully, but these errors were encountered: