Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix LongGCDisruption to be aware of log4j2 #20348

Merged
merged 5 commits into from
Sep 15, 2016

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Sep 6, 2016

LongGCDisruption simulates a Long GC by suspending all threads belonging to a node. That's fine, unless those threads hold shared locks that can prevent other nodes from running. Concretely the logging infrastructure, which is shared between the nodes, can cause some deadlocks. LongGCDisruption has protection for this, but it needs to be updated to point at log4j2 classes, introduced in #20235

This commit also fixes improper handling of retry logic in LongGCDisruption and adds a protection against deadlocking the test code which activates the disruption (and uses logging too! :)).

On top of that we have some new, evil and nasty tests.

LongGCDisruption simulates a Long GC by suspending all threads belonging to a node. That's fine, unless those threads hold shared locks that can prevent other nodes from running. Concretely the logging infrastructure, which is shared between the nodes, can cause some deadlocks. LongGCDisruption has protection for this, but it needs to be updated to point at log4j2 classes, introduced in elastic#20235

This commit also fixes improper handling of retry logic in LongGCDisruption and adds a protection against deadlocking the test code which activates the disruption (and uses logging too! :)).

On top of that we have some new, evil and nasty tests.
@bleskes bleskes added >bug >test Issues or PRs that are addressing/adding tests v5.0.0-beta1 labels Sep 6, 2016
@bleskes
Copy link
Contributor Author

bleskes commented Sep 6, 2016

@jasontedor all yours :)

@bleskes bleskes changed the title Fix LongGCDisruption to be aware of log4j Fix LongGCDisruption to be aware of log4j2 Sep 6, 2016
@jasontedor
Copy link
Member

retest this please

try {
suspendedThreads = ConcurrentHashMap.newKeySet();

final String currentThreadNamme = Thread.currentThread().getName();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: currentThreadNamme -> currentThreadName

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bleskes. I left some comments.

@@ -35,7 +37,7 @@

private static final Pattern[] unsafeClasses = new Pattern[]{
// logging has shared JVM locks - we may suspend a thread and block other nodes from doing their thing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the indentation of this comment is off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -77,6 +132,14 @@ public TimeValue expectedTimeToHeal() {

@SuppressWarnings("deprecation") // stops/resumes threads intentionally
@SuppressForbidden(reason = "stops/resumes threads intentionally")
/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this Javadoc comment is misplaced, I think that it has to be above the annotations to be considered a Javadoc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

} else {
throw new IllegalStateException("can't disrupt twice, call stopDisrupting() first");
}
}

private String stackTrace(Thread thread) {
String result = "";
Copy link
Member

@jasontedor jasontedor Sep 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about return Arrays.stream(thread.currentThread().getStackTrace()).map(Object::toString).collect(Collectors.joining("\n")); since StackTraceElement#toString includes all the information that you're including, with some smarts for handling things like line numbers that do not exist, etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx! changed

try {
stoppingThread.join(getStoppingTimeoutInMillis());
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we still marching on after being interrupted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to throw an exception and stop the underlying thread

if (success == false) {
// resume threads if failed
resumeThreads(suspendedThreads);
suspendedThreads = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to indicate higher up that the disruption failed to be applied?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're here, it's because an exception was thrown, so I think we're good?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, silly me. 😄

final CountDownLatch underLock = new CountDownLatch(1);
final CountDownLatch pauseUnderLock = new CountDownLatch(1);
final LockedExecutor lockedExecutor = new LockedExecutor();
final AtomicLong ops = new AtomicLong();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see the point of this variable, it's incremented but otherwise never read?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to give the thread something to do

if (stoppingThread.isAlive()) {
logger.warn("failed to stop node [{}]'s threads within [{}] millis. Stopping thread stack trace:\n {}"
, disruptedNode, getStoppingTimeoutInMillis(), stackTrace(stoppingThread));
stoppingThread.interrupt(); // best effort;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where the interrupt flag is ever checked on the stopping thread, what am I missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing :) I added a check

@Override
protected Pattern[] getUnsafeClasses() {
return new Pattern[]{
Pattern.compile("LockedExecutor")
Copy link
Member

@jasontedor jasontedor Sep 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"LockedExecutor" -> LockedExecutor.class.getSimpleName()?

@Override
protected Pattern[] getUnsafeClasses() {
return new Pattern[]{
Pattern.compile("LockedExecutor")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"LockedExecutor" -> LockedExecutor.class.getSimpleName()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

}
}

public void testNotBlockingUnsafeStackTraces() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test could use some comments explaining what is happening here, it's quite tricky. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added 🏫

@bleskes
Copy link
Contributor Author

bleskes commented Sep 14, 2016

Thx @jasontedor I addressed all points

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jasontedor
Copy link
Member

retest this please

@bleskes bleskes merged commit 8469c98 into elastic:master Sep 15, 2016
@bleskes bleskes deleted the gc_disrupt_log4j branch September 15, 2016 06:50
@bleskes
Copy link
Contributor Author

bleskes commented Sep 15, 2016

Thx @jasontedor

bleskes added a commit that referenced this pull request Sep 15, 2016
LongGCDisruption simulates a Long GC by suspending all threads belonging to a node. That's fine, unless those threads hold shared locks that can prevent other nodes from running. Concretely the logging infrastructure, which is shared between the nodes, can cause some deadlocks. LongGCDisruption has protection for this, but it needs to be updated to point at log4j2 classes, introduced in #20235

This commit also fixes improper handling of retry logic in LongGCDisruption and adds a protection against deadlocking the test code which activates the disruption (and uses logging too! :)).

On top of that we have some new, evil and nasty tests.
bleskes added a commit that referenced this pull request Sep 15, 2016
LongGCDisruption simulates a Long GC by suspending all threads belonging to a node. That's fine, unless those threads hold shared locks that can prevent other nodes from running. Concretely the logging infrastructure, which is shared between the nodes, can cause some deadlocks. LongGCDisruption has protection for this, but it needs to be updated to point at log4j2 classes, introduced in #20235

This commit also fixes improper handling of retry logic in LongGCDisruption and adds a protection against deadlocking the test code which activates the disruption (and uses logging too! :)).

On top of that we have some new, evil and nasty tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>test Issues or PRs that are addressing/adding tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants