-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix LongGCDisruption to be aware of log4j2 #20348
Conversation
LongGCDisruption simulates a Long GC by suspending all threads belonging to a node. That's fine, unless those threads hold shared locks that can prevent other nodes from running. Concretely the logging infrastructure, which is shared between the nodes, can cause some deadlocks. LongGCDisruption has protection for this, but it needs to be updated to point at log4j2 classes, introduced in elastic#20235 This commit also fixes improper handling of retry logic in LongGCDisruption and adds a protection against deadlocking the test code which activates the disruption (and uses logging too! :)). On top of that we have some new, evil and nasty tests.
@jasontedor all yours :) |
retest this please |
try { | ||
suspendedThreads = ConcurrentHashMap.newKeySet(); | ||
|
||
final String currentThreadNamme = Thread.currentThread().getName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: currentThreadNamme
-> currentThreadName
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bleskes. I left some comments.
@@ -35,7 +37,7 @@ | |||
|
|||
private static final Pattern[] unsafeClasses = new Pattern[]{ | |||
// logging has shared JVM locks - we may suspend a thread and block other nodes from doing their thing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the indentation of this comment is off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -77,6 +132,14 @@ public TimeValue expectedTimeToHeal() { | |||
|
|||
@SuppressWarnings("deprecation") // stops/resumes threads intentionally | |||
@SuppressForbidden(reason = "stops/resumes threads intentionally") | |||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this Javadoc comment is misplaced, I think that it has to be above the annotations to be considered a Javadoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
} else { | ||
throw new IllegalStateException("can't disrupt twice, call stopDisrupting() first"); | ||
} | ||
} | ||
|
||
private String stackTrace(Thread thread) { | ||
String result = ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about return Arrays.stream(thread.currentThread().getStackTrace()).map(Object::toString).collect(Collectors.joining("\n"));
since StackTraceElement#toString
includes all the information that you're including, with some smarts for handling things like line numbers that do not exist, etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx! changed
try { | ||
stoppingThread.join(getStoppingTimeoutInMillis()); | ||
} catch (InterruptedException e) { | ||
Thread.currentThread().interrupt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we still marching on after being interrupted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to throw an exception and stop the underlying thread
if (success == false) { | ||
// resume threads if failed | ||
resumeThreads(suspendedThreads); | ||
suspendedThreads = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to indicate higher up that the disruption failed to be applied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we're here, it's because an exception was thrown, so I think we're good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, silly me. 😄
final CountDownLatch underLock = new CountDownLatch(1); | ||
final CountDownLatch pauseUnderLock = new CountDownLatch(1); | ||
final LockedExecutor lockedExecutor = new LockedExecutor(); | ||
final AtomicLong ops = new AtomicLong(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see the point of this variable, it's incremented but otherwise never read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to give the thread something to do
if (stoppingThread.isAlive()) { | ||
logger.warn("failed to stop node [{}]'s threads within [{}] millis. Stopping thread stack trace:\n {}" | ||
, disruptedNode, getStoppingTimeoutInMillis(), stackTrace(stoppingThread)); | ||
stoppingThread.interrupt(); // best effort; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where the interrupt flag is ever checked on the stopping thread, what am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing :) I added a check
@Override | ||
protected Pattern[] getUnsafeClasses() { | ||
return new Pattern[]{ | ||
Pattern.compile("LockedExecutor") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"LockedExecutor"
-> LockedExecutor.class.getSimpleName()
?
@Override | ||
protected Pattern[] getUnsafeClasses() { | ||
return new Pattern[]{ | ||
Pattern.compile("LockedExecutor") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"LockedExecutor"
-> LockedExecutor.class.getSimpleName()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
} | ||
} | ||
|
||
public void testNotBlockingUnsafeStackTraces() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test could use some comments explaining what is happening here, it's quite tricky. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added 🏫
Thx @jasontedor I addressed all points |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
retest this please |
Thx @jasontedor |
LongGCDisruption simulates a Long GC by suspending all threads belonging to a node. That's fine, unless those threads hold shared locks that can prevent other nodes from running. Concretely the logging infrastructure, which is shared between the nodes, can cause some deadlocks. LongGCDisruption has protection for this, but it needs to be updated to point at log4j2 classes, introduced in #20235 This commit also fixes improper handling of retry logic in LongGCDisruption and adds a protection against deadlocking the test code which activates the disruption (and uses logging too! :)). On top of that we have some new, evil and nasty tests.
LongGCDisruption simulates a Long GC by suspending all threads belonging to a node. That's fine, unless those threads hold shared locks that can prevent other nodes from running. Concretely the logging infrastructure, which is shared between the nodes, can cause some deadlocks. LongGCDisruption has protection for this, but it needs to be updated to point at log4j2 classes, introduced in #20235 This commit also fixes improper handling of retry logic in LongGCDisruption and adds a protection against deadlocking the test code which activates the disruption (and uses logging too! :)). On top of that we have some new, evil and nasty tests.
LongGCDisruption simulates a Long GC by suspending all threads belonging to a node. That's fine, unless those threads hold shared locks that can prevent other nodes from running. Concretely the logging infrastructure, which is shared between the nodes, can cause some deadlocks. LongGCDisruption has protection for this, but it needs to be updated to point at log4j2 classes, introduced in #20235
This commit also fixes improper handling of retry logic in LongGCDisruption and adds a protection against deadlocking the test code which activates the disruption (and uses logging too! :)).
On top of that we have some new, evil and nasty tests.