Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix data race on tx.Stats #373

Merged
merged 7 commits into from
Jan 4, 2023
Merged

fix data race on tx.Stats #373

merged 7 commits into from
Jan 4, 2023

Conversation

fuweid
Copy link
Member

@fuweid fuweid commented Jan 2, 2023

Fixes: #213

Signed-off-by: Wei Fu fuweid89@gmail.com


RFC: Even if updating the type of Stats interface breaks the package usage, I think it should define the size of field with specific type instead of int. So we can use atomic to update the stats instead of introducing the lock.

cc @ahrtr @ptabor

Fixes: etcd-io#213

Signed-off-by: Wei Fu <fuweid89@gmail.com>
@ahrtr
Copy link
Member

ahrtr commented Jan 3, 2023

The change looks good to me.

Could you please update the workflows/tests.yaml to add race test using matrix? It can prevent such kind of issue from happening again in future.

Reference: https://github.com/etcd-io/etcd/blob/main/.github/workflows/tests.yaml

Signed-off-by: Wei Fu <fuweid89@gmail.com>
@fuweid fuweid marked this pull request as draft January 3, 2023 06:42
@fuweid
Copy link
Member Author

fuweid commented Jan 3, 2023

mark it draft for testing.

Signed-off-by: Wei Fu <fuweid89@gmail.com>
@fuweid fuweid force-pushed the fix-213 branch 2 times, most recently from 662922b to e878ba8 Compare January 3, 2023 07:26
Signed-off-by: Wei Fu <fuweid89@gmail.com>
The test log:

==================
WARNING: DATA RACE
Write at 0x00c000388370 by goroutine 37:
  go.etcd.io/bbolt/internal/btesting.MustOpenDBWithOption()
      /home/runner/work/bbolt/bbolt/internal/btesting/btesting.go:52 +0x1c7
  go.etcd.io/bbolt_test.TestDB_Concurrent_WriteTo.func1()
      /home/runner/work/bbolt/bbolt/db_test.go:685 +0x1f4
  go.etcd.io/bbolt_test.TestDB_Concurrent_WriteTo·dwrap·18()
      /home/runner/work/bbolt/bbolt/db_test.go:714 +0x47

Previous write at 0x00c000388370 by goroutine 83:
  go.etcd.io/bbolt/internal/btesting.MustOpenDBWithOption()
      /home/runner/work/bbolt/bbolt/internal/btesting/btesting.go:52 +0x1c7
  go.etcd.io/bbolt_test.TestDB_Concurrent_WriteTo.func1()
      /home/runner/work/bbolt/bbolt/db_test.go:685 +0x1f4
  go.etcd.io/bbolt_test.TestDB_Concurrent_WriteTo·dwrap·18()
      /home/runner/work/bbolt/bbolt/db_test.go:714 +0x47

Goroutine 37 (running) created at:
  go.etcd.io/bbolt_test.TestDB_Concurrent_WriteTo()
      /home/runner/work/bbolt/bbolt/db_test.go:714 +0x724
  testing.tRunner()
      /opt/hostedtoolcache/go/1.17.13/x64/src/testing/testing.go:1259 +0x22f
  testing.(*T).Run·dwrap·21()
      /opt/hostedtoolcache/go/1.17.13/x64/src/testing/testing.go:1306 +0x47

Goroutine 83 (running) created at:
  go.etcd.io/bbolt_test.TestDB_Concurrent_WriteTo()
      /home/runner/work/bbolt/bbolt/db_test.go:714 +0x724
  testing.tRunner()
      /opt/hostedtoolcache/go/1.17.13/x64/src/testing/testing.go:1259 +0x22f
  testing.(*T).Run·dwrap·21()
      /opt/hostedtoolcache/go/1.17.13/x64/src/testing/testing.go:1306 +0x47
==================

Signed-off-by: Wei Fu <fuweid89@gmail.com>
@fuweid fuweid force-pushed the fix-213 branch 7 times, most recently from cbc6940 to 4e98e8f Compare January 3, 2023 14:20
;;
linux-amd64-unit-test-4-cpu-race)
CPU=4 ENABLE_RACE=true make test-simulate
CPU=4 ENABLE_RACE=true make test
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea about the cpu usage when enable -race.

It seems that the standard github action runner doesn't allow user to run high cpu usage even if it provides two vcore to us. It will shutdown the runner or kill the process with SIGTERM.

I didn't see it before. maybe we can enable -race later?

@ahrtr

REF:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Etcd is running unit tests with 'race': https://github.com/etcd-io/etcd/actions/runs/3821152800/jobs/6500015229

I would try...
If it would not work, I would try running the --shorttests alone (that should be quick enough to not trigger high CPU usage).

Copy link
Member Author

@fuweid fuweid Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I was trying to use new go version, use taskset -c 0 to limit it in one core and limit it with one go queue.
But it all failed. Maybe it hits some limitation in the underlay of runner...

Side note: It always fails when test for array freelist type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mental model is that we should be deprecating array-freelist...
I'm not aware about adventages over the map-driven implementation.
So I would be OK for not testing in --race the array one.

@fuweid fuweid marked this pull request as ready for review January 3, 2023 14:33
Makefile Outdated

.PHONY: race fmt test lint
.PHONY: fmt test test-simulate lint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it add value to have both here: "test" and "test-simulate" ?
It seems to me that test-simulate is subset of the 'test'.

If we cannot afford running all tests in the --race mode, probably --short and 'test-simulate' in the race mode is offering a cross-cutting coverage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Since we still need to support array-freelist although I agree to deprecate it, I would suggest to only run test with flag --short for array-freelist.

FYI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted the test-simulate since it is duplicate.
And I split the make test into make test-freelist-hashmap and make test-freelist-array so that we can use EXTRA_TESTFLAGS=-short for array.

@@ -672,27 +673,27 @@ func (tx *Tx) Page(id int) (*PageInfo, error) {
// TxStats represents statistics about the actions performed by the transaction.
type TxStats struct {
// Page statistics.
PageCount int // number of page allocations
PageAlloc int // total bytes allocated
PageCount int64 // number of page allocations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go a step forward:

  1. Warn on the tx.Stats() method that direct usage of the fields is deprecated
    due to lack of safety on e.g. 32-bit architectures.

  2. Add to this class 2 methods for each field:

func (t* TxStats)IncrementPageCount(delta int64) {
return atomic...
}

func (t* TxStats)GetPageCount(delta int64) {
return atomic.LoadInt64(...)
}

such that user's can write safe code without depending on the arch-properties.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point!

The only minor comment is about the name on the Getter method, It isn't a best practice to add the prefix Get, for example, PageCount is better than GetPageCount in above example. FYI. https://go.dev/doc/effective_go#Getters

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. But we already have txStats.PageCount field.
I assume we cannot rename it (within major version) to txStats.pageCount as it would be breaking change of public API surface.
We also cannot have a field and method with the same name in a single class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. PTAL~

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to add comment to notify that we are planning not to expose each field, e.g change PageCount to pageCount .

fuweid added 2 commits January 4, 2023 20:05
Signed-off-by: Wei Fu <fuweid89@gmail.com>
Signed-off-by: Wei Fu <fuweid89@gmail.com>

.PHONY: race fmt test lint
.PHONY: fmt test test-freelist-hashmap test-freelist-array lint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it's still redundant... test resolves to both: test-freelist-{...} targets. And no race run by default at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this combination in action and 75% successful rate for the race matrix 😂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add the following matrix in the follow-up pr.

     linux-amd64-unit-test-2-cpu-simulate-race)
           CPU=2 ENABLE_RACE=true EXTRA_TESTFLAGS="-test.run='TestSimulate_(100op|1000op)'" make test            
           ;;

os: [ubuntu-latest, windows-latest]
runs-on: ${{ matrix.os }}
target:
- linux-amd64-unit-test-4-cpu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better we have the following three targets to cover the no-race cases:

  • linux-amd64-unit-test-1-cpu
  • linux-amd64-unit-test-2-cpu
  • linux-amd64-unit-test-4-cpu

Of course, it can be addressed in a separate PR if there is no any objection.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me with a couple of minor comments which can be addressed in a separate PR.

Thank you @fuweid

@ahrtr ahrtr merged commit a938f00 into etcd-io:master Jan 4, 2023
@fuweid fuweid deleted the fix-213 branch January 5, 2023 02:38
@fuweid
Copy link
Member Author

fuweid commented Jan 5, 2023

Thanks @ahrtr @ptabor for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Race in bucket stats
3 participants