Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add async/await support to TestKit and migrate tests to Async api to reduce racy failures #4072

Closed

Conversation

IgorFedchenko
Copy link
Contributor

@IgorFedchenko IgorFedchenko commented Nov 29, 2019

This PR is pretty big from the changed files count point of view. But the core idea is very simple:
we are blocking execution thread with Thread.Sleep each time when AwaitAssert or AwaitCondition is used. This is what #3854 is about. When lots of tests in multiple assemblies are running in parallel jobs on single machine, each test sometimes blocks execution thread. And there might be moments when too much threads are sleeping, which could lead to thread starvation, leak of free memory and so on.

So the idea is that until all Akka.NET tests are using blocking testkit API, each test that uses blocking api may become racy, because OS is becoming slow, Thread.Sleep takes much more time to awake then requested, etc.

Migration to async API should increase overall test performance pretty much, because this is the same as migration of ASP.NET Core server from blocking connections handling to async/await operations (in our case tests are the clients that taking resources from the system).

To make migration simpler, I marked sync method versions as obsolete (with compiler error enabled), and then fixed all compiler errors in solution. Well, almost all - there are MNTR tests, that are still using sync API right now (disabled compiler error for Obsolete methods), but I will update them later (and will add async specs handling to MNTR itself) if CI will show that there are less or none of racy failures after update.

Before merging this PR, I think I will remove Obsolete attributes from sync API methods, because there is nothing that much wrong with them, and they are pretty handy sometimes. At the moment, they are marked as Obsolete for simpler detection.

Close #3854
Close #3774

@IgorFedchenko
Copy link
Contributor Author

Some other racy spec issues may be closed if this PR will help.

@IgorFedchenko
Copy link
Contributor Author

I have issues with tests that are using EventFilter locally. Will try to fix this - ideally, all tests should be green in this PR (at least non-MNTR tests for now).

@IgorFedchenko
Copy link
Contributor Author

IgorFedchenko commented Nov 30, 2019

Now some tests are timing out. Locally I have interesting behavior: tests from Akka.Persistence.Tests namespace are running well under debugger (they are running one-by-one in that case), but when running without debugger in parallel (like 8-9 concurrent executions) they are blocking forever (until test runner will shut them down).

Under dev branch all is running fine of course.

Well, I might roll changes back or create separate PR with only Async API implementation and try to resolve each racy test separately... But still want to make this work as a whole.

UPDATE: It does not matter what tests are running under Akka.Persistence.Test - when there are 8+ parallel test executions, it is handing forever.

@IgorFedchenko
Copy link
Contributor Author

IgorFedchenko commented Nov 30, 2019

Need to make sure that I do not use AwaitSomethingAsync(...).Wait() instead of AwaitSomethingMethod(...) - this may cause deadlocks. Or consider using ConfigureAwait(false) for this scenarios, but using sync version should be cleaner.

@IgorFedchenko
Copy link
Contributor Author

Now here are the failures:

  1. On unit tests (only under Windows, both Full and Core frameworks):
  • Akka.Remote.Tests.RemotingSpec.Ask_does_not_deadlock , failed due to timeout while waiting for Result of Ask() task.
  1. MNTR tests:
  • Akka.Remote.Tests.MultiNode.TransportFailSpec.TransportFail_should_reconnect on the "first" node

I was just going to push update with Async api for IEventFilterApplier interface - let's see if/what tests will fail now.

@IgorFedchenko
Copy link
Contributor Author

Now failures changed.

MNTR under Windows:

  • A_restarted_quarantined_system_should_not_crash_the_other_system (both nodes)

Full Framework:

  • Akka.Remote.Tests.RemotingSpec.Ask_does_not_deadlock

.NET Core (Windows):

  • Akka.Persistence.Tests.AtLeastOnceDeliverySpec.AtLeastOnceDelivery_must_warn_about_unconfirmed_messages

Interesting that again all random failures are only under Windows.

@IgorFedchenko
Copy link
Contributor Author

So, looks like this update does not completely resolve racy tests issues.
Well, Ask_does_not_deadlock test for example is not affected by this changes at all (it did not use any of the updated methods from test kit).

So the challenge to fix all racy tests still exists, but this update should be helpful anyway.

Let me remove Obsolete attributes from sync method versions, and PR will be ready for review.

@IgorFedchenko
Copy link
Contributor Author

Have also implemented async API for IEventFilterApplier, and added some usages in tests to verify that they are working.
Did not update all sync calls in tests - @Aaronontheweb let me know if you think this would be helpful.

@Aaronontheweb
Copy link
Member

So the challenge to fix all racy tests still exists, but this update should be helpful anyway.

Have some thoughts on that...

Let me remove Obsolete attributes from sync method versions, and PR will be ready for review.

Sounds like a good idea.

@Aaronontheweb
Copy link
Member

Some thoughts on the racy unit tests - we've made huge efforts in the past to try to stamp these out, and heuristically there are a small number of things that can cause these flaky tests.

First, the flip rate report for all failed tests in our suite going back 30 days - https://dev.azure.com/dotnet/Akka.NET/_test/analytics?definitionId=84&contextType=build

You need to exclude the API approval specs, which dominate the top of this list, from the set of "racy" specs. These tests are doing their job when they fail - they're designed to alert the core development team when a public API change is being proposed so we don't accidentally sneak in a breaking change.

Other than those specs, we have the following classes of specs which fail often, in descending order:

1 - ClusterClient MNTR specs, the absolute worst offender.
2 - Akka.Tests.IO.TcpIntegrationSpec.Should_fail_writing_when_buffer_is_filled and Akka.Remote.Tests.RemotingSpec.Ask_does_not_deadlock - two racy I/O related specs, take up the number two seat.
3. Akka.Streams.Tests.Dsl.FlowSplitWhenSpec.SplitWhen_must_fail_substream_if_materialized_twice
4. A TON of Akka.Cluster.Sharding specs, including MNTR and normal specs
5. A hodge-podge of Akka.IO, persistence, and streams specs
6. Some MNTR specs
7. A ton of Akka.Streams and AtLeastOnceDelivery actor specs that all have 1-2 failures each.

What's killing us with our test suite is that it's large enough, concurrent enough, and dead-line sensitive enough that it doesn't take much to cause a single test to fail.

I'm a little reluctant to pull in these changes right away since they don't seem to address some of the underlying causes of the failures, but we do want the async API for testing and we want to use it.

The problem we need to solve initially is getting the suite to pass. I'm going to follow-up after lunch here with a decision tree for debugging and fixing these racy tests.

@IgorFedchenko
Copy link
Contributor Author

@Aaronontheweb

I'm a little reluctant to pull in these changes right away since they don't seem to address some of the underlying causes of the failures, but we do want the async API for testing and we want to use it.

To make async API available I can make separate PR that will just use those commits where I did API changes. And in this PR we can continue work on racy tests... Or just close it and start separate issue (if we do not have single one that aggregates the problem).

@Aaronontheweb
Copy link
Member

To make async API available I can make separate PR that will just use those commits where I did API changes. And in this PR we can continue work on racy tests... Or just close it

That's a great idea - let's do that.

@Aaronontheweb
Copy link
Member

cc @IgorFedchenko related: #3786

@IgorFedchenko
Copy link
Contributor Author

cc @IgorFedchenko related: #3786

Yup, looks like after making separate PR with async API addition, need to check out that issue and try to make some PRs for it. Also, very interesting how did you fix some of them.

@Aaronontheweb
Copy link
Member

@IgorFedchenko added my decision tree to here: #3786 (comment)

@IgorFedchenko
Copy link
Contributor Author

@IgorFedchenko added my decision tree to here: #3786 (comment)

Thanks, will take a look at this.

In the meanwhile - here is my PR with only async API addition: #4075 .

@IgorFedchenko
Copy link
Contributor Author

@Aaronontheweb So, there is no point in this PR anymore I guess? Core API updates were pushed to separate PR, and just updating most of the tests to async API does not solve the problem.

If so, feel free to close it. Will do farther work on #3786 and that separate PR (if requires any).

@Aaronontheweb
Copy link
Member

@IgorFedchenko changing some of the tests to use the async API, piecemeal, might help - but we should look at the tests one at a time. We would have never learned that without this PR though, so this work is definitely not wasted and we may even need it again. You did a great job here.

@IgorFedchenko IgorFedchenko deleted the async-await-in-testkit branch December 16, 2019 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants