Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LEDGER] fix couchDB UT flakes: increase waitFor and add statecouchdb to serial package list (in UT) #1092

Merged
merged 2 commits into from
Apr 17, 2020

Conversation

cendhu
Copy link
Contributor

@cendhu cendhu commented Apr 15, 2020

Type of change

  • Bug fix

Description

The CouchDB creates an index in the background and responds to the REST API immediately.
Currently, we wait only for 2 seconds before failing the test. Within the 2 seconds, every 100 milliseconds, CouchDB is queried to ensure that the passed index is being built and can be used for the query. On a first success, the test is passed.

As we have only 2 vCPUs, 4 GB of RAM to run CI, it may not be sufficient enough for CouchDB to build an index within the 2 seconds. Further, if we run tests in parallel, the CouchDB could face CPU contention too. Note that the CouchDB consumes 3x higher CPU utilization as compared to the goleveldb.

Hence, we increase the wait time to 6 seconds and the query time to 1 second.

Additional details

It is bad to do the fix with trail and error. However, these errors are quite hard to regenerate at the local development environment as the resource allocations varies significantly (like 8 vCPUs, 32 GB ram, SSD vs 2 vCPU, 4 GB ram, SDD).

Related issues

FAB-17755

Signed-off-by: senthil <cendhu@gmail.com>
@cendhu cendhu requested a review from a team as a code owner April 15, 2020 06:08
@cendhu cendhu changed the title [LEDGER] couchDB UT flakes: increase waitFor couchdb index creation [LEDGER] fix couchDB UT flakes: increase waitFor couchdb index creation Apr 15, 2020
@cendhu
Copy link
Contributor Author

cendhu commented Apr 15, 2020

@btl5037 Let's rerun the test jobs for some number of times to gain some confidence. After that, we can merge if an increase in the wait time fixes the problem permanently. It is hard to reproduce the error in the local development environment due to differences in the resource allocation.

@lindluni
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lindluni
Copy link
Contributor

Still failing unfortunately :(

@cendhu
Copy link
Contributor Author

cendhu commented Apr 15, 2020

Yes 😞

Is it possible to get resource utilization of the VM? It would be good to see how much CPU, disk, and memory are utilized? If it is too much overloaded, fixing a waitFor time is hard. At least in my mac book pro, when the test is running, all cores are utilized fully though I have 8 cores.

@cendhu
Copy link
Contributor Author

cendhu commented Apr 16, 2020

@btl5037 While you collect the resource utilization profile, to test my hypothesis on the overloaded situation, I have added statecouchdb to the serial package list. With this change, the test shouldn't fail as 2 dedicated CPU core should be good enough. Let's see what happens and then we can continue our analysis.

@lindluni
Copy link
Contributor

@btl5037 While you collect the resource utilization profile, to test my hypothesis on the overloaded situation, I have added statecouchdb to the serial package list. With this change, the test shouldn't fail as 2 dedicated CPU core should be good enough. Let's see what happens and then we can continue our analysis.

I did try to use Microsoft Diagnostics, and there wasn't any useful information, I will go back and test the other method we discussed today. I got sidetracked with a million other things yesterday.

@cendhu
Copy link
Contributor Author

cendhu commented Apr 16, 2020

/azp run

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 1092 in repo hyperledger/fabric

@lindluni
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lindluni
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lindluni
Copy link
Contributor

This seems to be working so far

when tests are run in parallel, CouchDB related test may not
get adequate time to successfully test asynchronous operations.

Signed-off-by: senthil <cendhu@gmail.com>
@cendhu cendhu force-pushed the fix-couchdb-flakes branch from 45f28c6 to d243d25 Compare April 17, 2020 11:16
@cendhu cendhu changed the title [LEDGER] fix couchDB UT flakes: increase waitFor couchdb index creation [LEDGER] fix couchDB UT flakes: increase waitFor and add statecouchdb to serial package list (in UT) Apr 17, 2020
@cendhu
Copy link
Contributor Author

cendhu commented Apr 17, 2020

Yes. It would add 22 extra seconds to test but saves much more time (as we may not need to rerun tests often). If there is no objection from maintainers, this PR can be merged and we can check whether the problem reappears.

@lindluni lindluni merged commit e70cf33 into hyperledger:master Apr 17, 2020
@@ -662,7 +662,7 @@ func TestHandleChaincodeDeployErroneousIndexFile(t *testing.T) {
}
return true
}
assert.Eventually(t, queryUsingIndex, 2*time.Second, 100*time.Millisecond, "error executing query with sort")
Copy link
Contributor

@sykesm sykesm Apr 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the assertion failure that I'm seeing in CI. I see that there was another change for TestHandleChaincodeDeploy but I don't know if that was due to an actual failure.

2020-04-17T14:17:09.8628824Z --- FAIL: TestHandleChaincodeDeployErroneousIndexFile (2.30s)
2020-04-17T14:17:09.8629124Z     statecouchdb_test.go:39: Initializing TestVDBEnv
2020-04-17T14:17:09.8629382Z     statecouchdb_test.go:741: 
2020-04-17T14:17:09.8629650Z         	Error Trace:	statecouchdb_test.go:741
2020-04-17T14:17:09.8629922Z         	Error:      	Condition never satisfied
2020-04-17T14:17:09.8630206Z         	Test:       	TestHandleChaincodeDeployErroneousIndexFile
2020-04-17T14:17:09.8630518Z         	Messages:   	error executing query with sort
2020-04-17T14:17:09.8630796Z     statecouchdb_test.go:86: Cleaningup TestVDBEnv

Focusing on this test, and to be fair, I'm not really sure what the intent is...

The name implies we want to see what happens when a bad index file is used, asserts there's an error creating the indexes, and then there's a query that actually depends on the creation of the index from indexSizeSortName.json.

The problem is that ProcessIndexesForChaincodeDeploy is iterating over the map of indexData and returning on the first error. As map iteration is non-deterministic, there's a 50/50 shot this the first index is the bad one. When that happens, the second index isn't created and the assertion in eventually will never succeed.

So, to me, the fix would be to make this deterministic. One way is to process the "good" index first, assert no errors, and then do the bad.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My local change, for example, successfully runs 100 iterations of the test:

✓ [11:30:35] [~/workspace/fabric/src/github.com/hyperledger/fabric]
[sykesm@trapped] (master *)$ go test -run TestHandleChaincodeDeployErroneousIndexFile -failfast -count 100 ./core/ledger/kvledger/txmgmt/statedb/statecouchdb/
ok  	github.com/hyperledger/fabric/core/ledger/kvledger/txmgmt/statedb/statecouchdb	33.235s
diff --git a/core/ledger/kvledger/txmgmt/statedb/statecouchdb/statecouchdb_test.go b/core/ledger/kvledger/txmgmt/statedb/statecouchdb/statecouchdb_test.go
index 58530ca19..435091b00 100644
--- a/core/ledger/kvledger/txmgmt/statedb/statecouchdb/statecouchdb_test.go
+++ b/core/ledger/kvledger/txmgmt/statedb/statecouchdb/statecouchdb_test.go
@@ -718,18 +718,21 @@ func TestHandleChaincodeDeployErroneousIndexFile(t *testing.T) {
        batch.Put("ns1", "key1", []byte(`{"asset_name": "marble1","color": "blue","size": 1,"owner": "tom"}`), version.NewHeight(1, 1))
        batch.Put("ns1", "key2", []byte(`{"asset_name": "marble2","color": "blue","size": 2,"owner": "jerry"}`), version.NewHeight(1, 2))
 
-       badSyntaxFileContent := `{"index":{"fields": This is a bad json}`
        indexData := map[string][]byte{
                "META-INF/statedb/couchdb/indexes/indexSizeSortName.json": []byte(`{"index":{"fields":[{"size":"desc"}]},"ddoc":"indexSizeSortName","name":"indexSizeSortName","type":"json"}`),
-               "META-INF/statedb/couchdb/indexes/badSyntax.json":         []byte(badSyntaxFileContent),
        }
 
        indexCapable, ok := db.(statedb.IndexCapable)
        if !ok {
                t.Fatalf("Couchdb state impl is expected to implement interface `statedb.IndexCapable`")
        }
-       assert.Error(t, indexCapable.ProcessIndexesForChaincodeDeploy("ns1", indexData))
+       assert.NoError(t, indexCapable.ProcessIndexesForChaincodeDeploy("ns1", indexData))
 
+       badSyntaxFileContent := `{"index":{"fields": This is a bad json}`
+       indexData = map[string][]byte{
+               "META-INF/statedb/couchdb/indexes/badSyntax.json": []byte(badSyntaxFileContent),
+       }
+       assert.Error(t, indexCapable.ProcessIndexesForChaincodeDeploy("ns1", indexData))
        queryString := `{"selector":{"owner":"fred"}, "sort": [{"size": "desc"}]}`
        queryUsingIndex := func() bool {
                _, err = db.ExecuteQuery("ns1", queryString)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @sykesm for teaching me how to do it correctly. We have to be careful with the maps. This non-determinism was introduced by #966 As you have the fix, can you please revert the changes I made and push yours?

@btl5037 Sorry for the trouble. I had a wrong hypothesis on the resource usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sykesm This could create non-determinism even in the production setup. Some peers would have the index and some wouldn't. Hence, let me discuss this issue in ledger scrum today and push a fix by tonight.

@cendhu cendhu deleted the fix-couchdb-flakes branch April 18, 2020 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants