-
Notifications
You must be signed in to change notification settings - Fork 660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix data race on tx.Stats #373
Conversation
Fixes: etcd-io#213 Signed-off-by: Wei Fu <fuweid89@gmail.com>
The change looks good to me. Could you please update the workflows/tests.yaml to add race test using matrix? It can prevent such kind of issue from happening again in future. Reference: https://github.com/etcd-io/etcd/blob/main/.github/workflows/tests.yaml |
Signed-off-by: Wei Fu <fuweid89@gmail.com>
mark it draft for testing. |
Signed-off-by: Wei Fu <fuweid89@gmail.com>
662922b
to
e878ba8
Compare
Signed-off-by: Wei Fu <fuweid89@gmail.com>
The test log: ================== WARNING: DATA RACE Write at 0x00c000388370 by goroutine 37: go.etcd.io/bbolt/internal/btesting.MustOpenDBWithOption() /home/runner/work/bbolt/bbolt/internal/btesting/btesting.go:52 +0x1c7 go.etcd.io/bbolt_test.TestDB_Concurrent_WriteTo.func1() /home/runner/work/bbolt/bbolt/db_test.go:685 +0x1f4 go.etcd.io/bbolt_test.TestDB_Concurrent_WriteTo·dwrap·18() /home/runner/work/bbolt/bbolt/db_test.go:714 +0x47 Previous write at 0x00c000388370 by goroutine 83: go.etcd.io/bbolt/internal/btesting.MustOpenDBWithOption() /home/runner/work/bbolt/bbolt/internal/btesting/btesting.go:52 +0x1c7 go.etcd.io/bbolt_test.TestDB_Concurrent_WriteTo.func1() /home/runner/work/bbolt/bbolt/db_test.go:685 +0x1f4 go.etcd.io/bbolt_test.TestDB_Concurrent_WriteTo·dwrap·18() /home/runner/work/bbolt/bbolt/db_test.go:714 +0x47 Goroutine 37 (running) created at: go.etcd.io/bbolt_test.TestDB_Concurrent_WriteTo() /home/runner/work/bbolt/bbolt/db_test.go:714 +0x724 testing.tRunner() /opt/hostedtoolcache/go/1.17.13/x64/src/testing/testing.go:1259 +0x22f testing.(*T).Run·dwrap·21() /opt/hostedtoolcache/go/1.17.13/x64/src/testing/testing.go:1306 +0x47 Goroutine 83 (running) created at: go.etcd.io/bbolt_test.TestDB_Concurrent_WriteTo() /home/runner/work/bbolt/bbolt/db_test.go:714 +0x724 testing.tRunner() /opt/hostedtoolcache/go/1.17.13/x64/src/testing/testing.go:1259 +0x22f testing.(*T).Run·dwrap·21() /opt/hostedtoolcache/go/1.17.13/x64/src/testing/testing.go:1306 +0x47 ================== Signed-off-by: Wei Fu <fuweid89@gmail.com>
cbc6940
to
4e98e8f
Compare
.github/workflows/tests.yaml
Outdated
;; | ||
linux-amd64-unit-test-4-cpu-race) | ||
CPU=4 ENABLE_RACE=true make test-simulate | ||
CPU=4 ENABLE_RACE=true make test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea about the cpu usage when enable -race
.
It seems that the standard github action runner doesn't allow user to run high cpu usage even if it provides two vcore to us. It will shutdown the runner or kill the process with SIGTERM.
- ubuntu-latest: jobs fail with error code 143 actions/runner-images#6680 (comment)
- The runner has received a shutdown signal. actions/runner-images#6709
I didn't see it before. maybe we can enable -race
later?
REF:
- https://github.com/fuweid/bbolt/actions/runs/3827981238/jobs/6513066220 (runner sends SIGTERM to
go test
- https://github.com/fuweid/bbolt/actions/runs/3828126044/jobs/6513369814 (shutdown the runner)
- https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Etcd is running unit tests with 'race': https://github.com/etcd-io/etcd/actions/runs/3821152800/jobs/6500015229
I would try...
If it would not work, I would try running the --short
tests alone (that should be quick enough to not trigger high CPU usage).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I was trying to use new go version, use taskset -c 0
to limit it in one core and limit it with one go queue.
But it all failed. Maybe it hits some limitation in the underlay of runner...
Side note: It always fails when test for array freelist type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mental model is that we should be deprecating array-freelist...
I'm not aware about adventages over the map-driven implementation.
So I would be OK for not testing in --race the array one.
Makefile
Outdated
|
||
.PHONY: race fmt test lint | ||
.PHONY: fmt test test-simulate lint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it add value to have both here: "test" and "test-simulate" ?
It seems to me that test-simulate is subset of the 'test'.
If we cannot afford running all tests in the --race mode, probably --short and 'test-simulate' in the race mode is offering a cross-cutting coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
Since we still need to support array-freelist
although I agree to deprecate it, I would suggest to only run test with flag --short
for array-freelist
.
FYI.
- https://github.com/etcd-io/etcd/blob/e73f55d4e94666c99558baa2fd4e365aeaca4dc4/scripts/test.sh#L100
- https://github.com/etcd-io/etcd/blob/e73f55d4e94666c99558baa2fd4e365aeaca4dc4/tests/e2e/etcd_config_test.go#L33
Lines 34 to 36 in b654ce9
if testing.Short() { t.Skip("skipping test in short mode.") }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deleted the test-simulate
since it is duplicate.
And I split the make test
into make test-freelist-hashmap
and make test-freelist-array
so that we can use EXTRA_TESTFLAGS=-short
for array
.
@@ -672,27 +673,27 @@ func (tx *Tx) Page(id int) (*PageInfo, error) { | |||
// TxStats represents statistics about the actions performed by the transaction. | |||
type TxStats struct { | |||
// Page statistics. | |||
PageCount int // number of page allocations | |||
PageAlloc int // total bytes allocated | |||
PageCount int64 // number of page allocations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would go a step forward:
-
Warn on the tx.Stats() method that direct usage of the fields is deprecated
due to lack of safety on e.g. 32-bit architectures. -
Add to this class 2 methods for each field:
func (t* TxStats)IncrementPageCount(delta int64) {
return atomic...
}
func (t* TxStats)GetPageCount(delta int64) {
return atomic.LoadInt64(...)
}
such that user's can write safe code without depending on the arch-properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point!
The only minor comment is about the name on the Getter method, It isn't a best practice to add the prefix Get
, for example, PageCount
is better than GetPageCount
in above example. FYI. https://go.dev/doc/effective_go#Getters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. But we already have txStats.PageCount
field.
I assume we cannot rename it (within major version) to txStats.pageCount
as it would be breaking change of public API surface.
We also cannot have a field and method with the same name in a single class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. PTAL~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to add comment to notify that we are planning not to expose each field, e.g change PageCount
to pageCount
.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
Signed-off-by: Wei Fu <fuweid89@gmail.com>
|
||
.PHONY: race fmt test lint | ||
.PHONY: fmt test test-freelist-hashmap test-freelist-array lint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it's still redundant... test resolves to both: test-freelist-{...} targets. And no race run by default at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this combination in action and 75% successful rate for the race matrix 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me add the following matrix in the follow-up pr.
linux-amd64-unit-test-2-cpu-simulate-race)
CPU=2 ENABLE_RACE=true EXTRA_TESTFLAGS="-test.run='TestSimulate_(100op|1000op)'" make test
;;
os: [ubuntu-latest, windows-latest] | ||
runs-on: ${{ matrix.os }} | ||
target: | ||
- linux-amd64-unit-test-4-cpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better we have the following three targets to cover the no-race cases:
- linux-amd64-unit-test-1-cpu
- linux-amd64-unit-test-2-cpu
- linux-amd64-unit-test-4-cpu
Of course, it can be addressed in a separate PR if there is no any objection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me with a couple of minor comments which can be addressed in a separate PR.
Thank you @fuweid
Fixes: #213
Signed-off-by: Wei Fu fuweid89@gmail.com
RFC: Even if updating the type of Stats interface breaks the package usage, I think it should define the size of field with specific type instead of int. So we can use atomic to update the stats instead of introducing the lock.
cc @ahrtr @ptabor