[LEDGER] fix couchDB UT flakes: increase waitFor and add statecouchdb to serial package list (in UT) #1092

cendhu · 2020-04-15T06:08:10Z

Type of change

Bug fix

Description

The CouchDB creates an index in the background and responds to the REST API immediately.
Currently, we wait only for 2 seconds before failing the test. Within the 2 seconds, every 100 milliseconds, CouchDB is queried to ensure that the passed index is being built and can be used for the query. On a first success, the test is passed.

As we have only 2 vCPUs, 4 GB of RAM to run CI, it may not be sufficient enough for CouchDB to build an index within the 2 seconds. Further, if we run tests in parallel, the CouchDB could face CPU contention too. Note that the CouchDB consumes 3x higher CPU utilization as compared to the goleveldb.

Hence, we increase the wait time to 6 seconds and the query time to 1 second.

Additional details

It is bad to do the fix with trail and error. However, these errors are quite hard to regenerate at the local development environment as the resource allocations varies significantly (like 8 vCPUs, 32 GB ram, SSD vs 2 vCPU, 4 GB ram, SDD).

Related issues

FAB-17755

Signed-off-by: senthil <cendhu@gmail.com>

cendhu · 2020-04-15T06:10:02Z

@btl5037 Let's rerun the test jobs for some number of times to gain some confidence. After that, we can merge if an increase in the wait time fixes the problem permanently. It is hard to reproduce the error in the local development environment due to differences in the resource allocation.

lindluni · 2020-04-15T15:46:58Z

/azp run

azure-pipelines · 2020-04-15T15:47:09Z

Azure Pipelines successfully started running 1 pipeline(s).

lindluni · 2020-04-15T16:14:21Z

Still failing unfortunately :(

cendhu · 2020-04-15T16:31:03Z

Yes 😞

Is it possible to get resource utilization of the VM? It would be good to see how much CPU, disk, and memory are utilized? If it is too much overloaded, fixing a waitFor time is hard. At least in my mac book pro, when the test is running, all cores are utilized fully though I have 8 cores.

cendhu · 2020-04-16T14:47:13Z

@btl5037 While you collect the resource utilization profile, to test my hypothesis on the overloaded situation, I have added statecouchdb to the serial package list. With this change, the test shouldn't fail as 2 dedicated CPU core should be good enough. Let's see what happens and then we can continue our analysis.

lindluni · 2020-04-16T14:49:16Z

@btl5037 While you collect the resource utilization profile, to test my hypothesis on the overloaded situation, I have added statecouchdb to the serial package list. With this change, the test shouldn't fail as 2 dedicated CPU core should be good enough. Let's see what happens and then we can continue our analysis.

I did try to use Microsoft Diagnostics, and there wasn't any useful information, I will go back and test the other method we discussed today. I got sidetracked with a million other things yesterday.

cendhu · 2020-04-16T16:32:13Z

/azp run

azure-pipelines · 2020-04-16T16:32:18Z

Commenter does not have sufficient privileges for PR 1092 in repo hyperledger/fabric

lindluni · 2020-04-17T02:35:21Z

/azp run

azure-pipelines · 2020-04-17T02:35:32Z

Azure Pipelines successfully started running 1 pipeline(s).

lindluni · 2020-04-17T04:29:01Z

/azp run

azure-pipelines · 2020-04-17T04:29:13Z

Azure Pipelines successfully started running 1 pipeline(s).

lindluni · 2020-04-17T04:49:52Z

This seems to be working so far

when tests are run in parallel, CouchDB related test may not get adequate time to successfully test asynchronous operations. Signed-off-by: senthil <cendhu@gmail.com>

cendhu · 2020-04-17T11:37:58Z

Yes. It would add 22 extra seconds to test but saves much more time (as we may not need to rerun tests often). If there is no objection from maintainers, this PR can be merged and we can check whether the problem reappears.

sykesm · 2020-04-17T15:25:34Z

core/ledger/kvledger/txmgmt/statedb/statecouchdb/statecouchdb_test.go

@@ -662,7 +662,7 @@ func TestHandleChaincodeDeployErroneousIndexFile(t *testing.T) {
 		}
 		return true
 	}
-	assert.Eventually(t, queryUsingIndex, 2*time.Second, 100*time.Millisecond, "error executing query with sort")


This is the assertion failure that I'm seeing in CI. I see that there was another change for TestHandleChaincodeDeploy but I don't know if that was due to an actual failure.

2020-04-17T14:17:09.8628824Z --- FAIL: TestHandleChaincodeDeployErroneousIndexFile (2.30s) 2020-04-17T14:17:09.8629124Z statecouchdb_test.go:39: Initializing TestVDBEnv 2020-04-17T14:17:09.8629382Z statecouchdb_test.go:741: 2020-04-17T14:17:09.8629650Z Error Trace: statecouchdb_test.go:741 2020-04-17T14:17:09.8629922Z Error: Condition never satisfied 2020-04-17T14:17:09.8630206Z Test: TestHandleChaincodeDeployErroneousIndexFile 2020-04-17T14:17:09.8630518Z Messages: error executing query with sort 2020-04-17T14:17:09.8630796Z statecouchdb_test.go:86: Cleaningup TestVDBEnv

Focusing on this test, and to be fair, I'm not really sure what the intent is...

The name implies we want to see what happens when a bad index file is used, asserts there's an error creating the indexes, and then there's a query that actually depends on the creation of the index from indexSizeSortName.json.

The problem is that ProcessIndexesForChaincodeDeploy is iterating over the map of indexData and returning on the first error. As map iteration is non-deterministic, there's a 50/50 shot this the first index is the bad one. When that happens, the second index isn't created and the assertion in eventually will never succeed.

So, to me, the fix would be to make this deterministic. One way is to process the "good" index first, assert no errors, and then do the bad.

My local change, for example, successfully runs 100 iterations of the test:

✓ [11:30:35] [~/workspace/fabric/src/github.com/hyperledger/fabric] [sykesm@trapped] (master *)$ go test -run TestHandleChaincodeDeployErroneousIndexFile -failfast -count 100 ./core/ledger/kvledger/txmgmt/statedb/statecouchdb/ ok github.com/hyperledger/fabric/core/ledger/kvledger/txmgmt/statedb/statecouchdb 33.235s

diff --git a/core/ledger/kvledger/txmgmt/statedb/statecouchdb/statecouchdb_test.go b/core/ledger/kvledger/txmgmt/statedb/statecouchdb/statecouchdb_test.go index 58530ca19..435091b00 100644 --- a/core/ledger/kvledger/txmgmt/statedb/statecouchdb/statecouchdb_test.go +++ b/core/ledger/kvledger/txmgmt/statedb/statecouchdb/statecouchdb_test.go @@ -718,18 +718,21 @@ func TestHandleChaincodeDeployErroneousIndexFile(t *testing.T) { batch.Put("ns1", "key1", []byte(`{"asset_name": "marble1","color": "blue","size": 1,"owner": "tom"}`), version.NewHeight(1, 1)) batch.Put("ns1", "key2", []byte(`{"asset_name": "marble2","color": "blue","size": 2,"owner": "jerry"}`), version.NewHeight(1, 2)) - badSyntaxFileContent := `{"index":{"fields": This is a bad json}` indexData := map[string][]byte{ "META-INF/statedb/couchdb/indexes/indexSizeSortName.json": []byte(`{"index":{"fields":[{"size":"desc"}]},"ddoc":"indexSizeSortName","name":"indexSizeSortName","type":"json"}`), - "META-INF/statedb/couchdb/indexes/badSyntax.json": []byte(badSyntaxFileContent), } indexCapable, ok := db.(statedb.IndexCapable) if !ok { t.Fatalf("Couchdb state impl is expected to implement interface `statedb.IndexCapable`") } - assert.Error(t, indexCapable.ProcessIndexesForChaincodeDeploy("ns1", indexData)) + assert.NoError(t, indexCapable.ProcessIndexesForChaincodeDeploy("ns1", indexData)) + badSyntaxFileContent := `{"index":{"fields": This is a bad json}` + indexData = map[string][]byte{ + "META-INF/statedb/couchdb/indexes/badSyntax.json": []byte(badSyntaxFileContent), + } + assert.Error(t, indexCapable.ProcessIndexesForChaincodeDeploy("ns1", indexData)) queryString := `{"selector":{"owner":"fred"}, "sort": [{"size": "desc"}]}` queryUsingIndex := func() bool { _, err = db.ExecuteQuery("ns1", queryString)

Thanks, @sykesm for teaching me how to do it correctly. We have to be careful with the maps. This non-determinism was introduced by #966 As you have the fix, can you please revert the changes I made and push yours?

@btl5037 Sorry for the trouble. I had a wrong hypothesis on the resource usage.

@sykesm This could create non-determinism even in the production setup. Some peers would have the index and some wouldn't. Hence, let me discuss this issue in ledger scrum today and push a fix by tonight.

increase waitFor couchdb index creation

d8c2b2c

Signed-off-by: senthil <cendhu@gmail.com>

cendhu requested a review from a team as a code owner April 15, 2020 06:08

cendhu changed the title ~~[LEDGER] couchDB UT flakes: increase waitFor couchdb index creation~~ [LEDGER] fix couchDB UT flakes: increase waitFor couchdb index creation Apr 15, 2020

add statecouchdb to serial package list

d243d25

when tests are run in parallel, CouchDB related test may not get adequate time to successfully test asynchronous operations. Signed-off-by: senthil <cendhu@gmail.com>

cendhu force-pushed the fix-couchdb-flakes branch from 45f28c6 to d243d25 Compare April 17, 2020 11:16

cendhu changed the title ~~[LEDGER] fix couchDB UT flakes: increase waitFor couchdb index creation~~ [LEDGER] fix couchDB UT flakes: increase waitFor and add statecouchdb to serial package list (in UT) Apr 17, 2020

cendhu mentioned this pull request Apr 17, 2020

Warn if a docker network may cause issues on macOS #1115

Merged

lindluni approved these changes Apr 17, 2020

View reviewed changes

lindluni merged commit e70cf33 into hyperledger:master Apr 17, 2020

sykesm reviewed Apr 17, 2020

View reviewed changes

sykesm mentioned this pull request Apr 17, 2020

Revert #1092: fix couchDB UT flakes #1118

Merged

cendhu deleted the fix-couchdb-flakes branch April 18, 2020 05:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LEDGER] fix couchDB UT flakes: increase waitFor and add statecouchdb to serial package list (in UT) #1092

[LEDGER] fix couchDB UT flakes: increase waitFor and add statecouchdb to serial package list (in UT) #1092

cendhu commented Apr 15, 2020

cendhu commented Apr 15, 2020 •

edited

Loading

lindluni commented Apr 15, 2020

azure-pipelines bot commented Apr 15, 2020

lindluni commented Apr 15, 2020

cendhu commented Apr 15, 2020

cendhu commented Apr 16, 2020

lindluni commented Apr 16, 2020

cendhu commented Apr 16, 2020

azure-pipelines bot commented Apr 16, 2020

lindluni commented Apr 17, 2020

azure-pipelines bot commented Apr 17, 2020

lindluni commented Apr 17, 2020

azure-pipelines bot commented Apr 17, 2020

lindluni commented Apr 17, 2020

cendhu commented Apr 17, 2020

sykesm Apr 17, 2020 •

edited

Loading

sykesm Apr 17, 2020

cendhu Apr 17, 2020

cendhu Apr 17, 2020

[LEDGER] fix couchDB UT flakes: increase waitFor and add statecouchdb to serial package list (in UT) #1092

[LEDGER] fix couchDB UT flakes: increase waitFor and add statecouchdb to serial package list (in UT) #1092

Conversation

cendhu commented Apr 15, 2020

Type of change

Description

Additional details

Related issues

cendhu commented Apr 15, 2020 • edited Loading

lindluni commented Apr 15, 2020

azure-pipelines bot commented Apr 15, 2020

lindluni commented Apr 15, 2020

cendhu commented Apr 15, 2020

cendhu commented Apr 16, 2020

lindluni commented Apr 16, 2020

cendhu commented Apr 16, 2020

azure-pipelines bot commented Apr 16, 2020

lindluni commented Apr 17, 2020

azure-pipelines bot commented Apr 17, 2020

lindluni commented Apr 17, 2020

azure-pipelines bot commented Apr 17, 2020

lindluni commented Apr 17, 2020

cendhu commented Apr 17, 2020

sykesm Apr 17, 2020 • edited Loading

Choose a reason for hiding this comment

sykesm Apr 17, 2020

Choose a reason for hiding this comment

cendhu Apr 17, 2020

Choose a reason for hiding this comment

cendhu Apr 17, 2020

Choose a reason for hiding this comment

cendhu commented Apr 15, 2020 •

edited

Loading

sykesm Apr 17, 2020 •

edited

Loading