-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LEDGER] fix couchDB UT flakes: increase waitFor and add statecouchdb to serial package list (in UT) #1092
Conversation
Signed-off-by: senthil <cendhu@gmail.com>
@btl5037 Let's rerun the test jobs for some number of times to gain some confidence. After that, we can merge if an increase in the wait time fixes the problem permanently. It is hard to reproduce the error in the local development environment due to differences in the resource allocation. |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Still failing unfortunately :( |
Yes 😞 Is it possible to get resource utilization of the VM? It would be good to see how much CPU, disk, and memory are utilized? If it is too much overloaded, fixing a waitFor time is hard. At least in my mac book pro, when the test is running, all cores are utilized fully though I have 8 cores. |
@btl5037 While you collect the resource utilization profile, to test my hypothesis on the overloaded situation, I have added |
I did try to use Microsoft Diagnostics, and there wasn't any useful information, I will go back and test the other method we discussed today. I got sidetracked with a million other things yesterday. |
/azp run |
Commenter does not have sufficient privileges for PR 1092 in repo hyperledger/fabric |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
This seems to be working so far |
when tests are run in parallel, CouchDB related test may not get adequate time to successfully test asynchronous operations. Signed-off-by: senthil <cendhu@gmail.com>
45f28c6
to
d243d25
Compare
Yes. It would add 22 extra seconds to test but saves much more time (as we may not need to rerun tests often). If there is no objection from maintainers, this PR can be merged and we can check whether the problem reappears. |
@@ -662,7 +662,7 @@ func TestHandleChaincodeDeployErroneousIndexFile(t *testing.T) { | |||
} | |||
return true | |||
} | |||
assert.Eventually(t, queryUsingIndex, 2*time.Second, 100*time.Millisecond, "error executing query with sort") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the assertion failure that I'm seeing in CI. I see that there was another change for TestHandleChaincodeDeploy
but I don't know if that was due to an actual failure.
2020-04-17T14:17:09.8628824Z --- FAIL: TestHandleChaincodeDeployErroneousIndexFile (2.30s)
2020-04-17T14:17:09.8629124Z statecouchdb_test.go:39: Initializing TestVDBEnv
2020-04-17T14:17:09.8629382Z statecouchdb_test.go:741:
2020-04-17T14:17:09.8629650Z Error Trace: statecouchdb_test.go:741
2020-04-17T14:17:09.8629922Z Error: Condition never satisfied
2020-04-17T14:17:09.8630206Z Test: TestHandleChaincodeDeployErroneousIndexFile
2020-04-17T14:17:09.8630518Z Messages: error executing query with sort
2020-04-17T14:17:09.8630796Z statecouchdb_test.go:86: Cleaningup TestVDBEnv
Focusing on this test, and to be fair, I'm not really sure what the intent is...
The name implies we want to see what happens when a bad index file is used, asserts there's an error creating the indexes, and then there's a query that actually depends on the creation of the index from indexSizeSortName.json
.
The problem is that ProcessIndexesForChaincodeDeploy
is iterating over the map of indexData
and returning on the first error. As map iteration is non-deterministic, there's a 50/50 shot this the first index is the bad one. When that happens, the second index isn't created and the assertion in eventually will never succeed.
So, to me, the fix would be to make this deterministic. One way is to process the "good" index first, assert no errors, and then do the bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My local change, for example, successfully runs 100 iterations of the test:
✓ [11:30:35] [~/workspace/fabric/src/github.com/hyperledger/fabric]
[sykesm@trapped] (master *)$ go test -run TestHandleChaincodeDeployErroneousIndexFile -failfast -count 100 ./core/ledger/kvledger/txmgmt/statedb/statecouchdb/
ok github.com/hyperledger/fabric/core/ledger/kvledger/txmgmt/statedb/statecouchdb 33.235s
diff --git a/core/ledger/kvledger/txmgmt/statedb/statecouchdb/statecouchdb_test.go b/core/ledger/kvledger/txmgmt/statedb/statecouchdb/statecouchdb_test.go
index 58530ca19..435091b00 100644
--- a/core/ledger/kvledger/txmgmt/statedb/statecouchdb/statecouchdb_test.go
+++ b/core/ledger/kvledger/txmgmt/statedb/statecouchdb/statecouchdb_test.go
@@ -718,18 +718,21 @@ func TestHandleChaincodeDeployErroneousIndexFile(t *testing.T) {
batch.Put("ns1", "key1", []byte(`{"asset_name": "marble1","color": "blue","size": 1,"owner": "tom"}`), version.NewHeight(1, 1))
batch.Put("ns1", "key2", []byte(`{"asset_name": "marble2","color": "blue","size": 2,"owner": "jerry"}`), version.NewHeight(1, 2))
- badSyntaxFileContent := `{"index":{"fields": This is a bad json}`
indexData := map[string][]byte{
"META-INF/statedb/couchdb/indexes/indexSizeSortName.json": []byte(`{"index":{"fields":[{"size":"desc"}]},"ddoc":"indexSizeSortName","name":"indexSizeSortName","type":"json"}`),
- "META-INF/statedb/couchdb/indexes/badSyntax.json": []byte(badSyntaxFileContent),
}
indexCapable, ok := db.(statedb.IndexCapable)
if !ok {
t.Fatalf("Couchdb state impl is expected to implement interface `statedb.IndexCapable`")
}
- assert.Error(t, indexCapable.ProcessIndexesForChaincodeDeploy("ns1", indexData))
+ assert.NoError(t, indexCapable.ProcessIndexesForChaincodeDeploy("ns1", indexData))
+ badSyntaxFileContent := `{"index":{"fields": This is a bad json}`
+ indexData = map[string][]byte{
+ "META-INF/statedb/couchdb/indexes/badSyntax.json": []byte(badSyntaxFileContent),
+ }
+ assert.Error(t, indexCapable.ProcessIndexesForChaincodeDeploy("ns1", indexData))
queryString := `{"selector":{"owner":"fred"}, "sort": [{"size": "desc"}]}`
queryUsingIndex := func() bool {
_, err = db.ExecuteQuery("ns1", queryString)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sykesm This could create non-determinism even in the production setup. Some peers would have the index and some wouldn't. Hence, let me discuss this issue in ledger scrum today and push a fix by tonight.
Type of change
Description
The CouchDB creates an index in the background and responds to the REST API immediately.
Currently, we wait only for 2 seconds before failing the test. Within the 2 seconds, every 100 milliseconds, CouchDB is queried to ensure that the passed index is being built and can be used for the query. On a first success, the test is passed.
As we have only 2 vCPUs, 4 GB of RAM to run CI, it may not be sufficient enough for CouchDB to build an index within the 2 seconds. Further, if we run tests in parallel, the CouchDB could face CPU contention too. Note that the CouchDB consumes 3x higher CPU utilization as compared to the goleveldb.
Hence, we increase the wait time to 6 seconds and the query time to 1 second.
Additional details
It is bad to do the fix with trail and error. However, these errors are quite hard to regenerate at the local development environment as the resource allocations varies significantly (like 8 vCPUs, 32 GB ram, SSD vs 2 vCPU, 4 GB ram, SDD).
Related issues
FAB-17755