Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: server run state management and library usage #1331

Open
cstockton opened this issue May 17, 2017 · 5 comments
Open

proposal: server run state management and library usage #1331

cstockton opened this issue May 17, 2017 · 5 comments

Comments

@cstockton
Copy link

Today the relation of Start, Stop, Serve and Shutdown is not immediately clear for an end user, making it difficult to use GoBGP as a library. I propose the GoBGP team defines the intended responsibility of Start, Stop, Serve and Shutdown. To kick off the discussion I'll create a summary inferred from the current design and the changes I would suggest from the perspective of an end user, we can go from there based on the backwards compatibility guarantees provided by the GoBGP team on what a good compromise is.

GoBGP Server guards against race conditions by synchronizing access to internal state through a single Go routine that executes within Serve(). The two primary types of work Serve receives are management operations and network requests from a listener. All management operations are all synchronized through a single function mgmtOperation(f func() error, checkActive bool) error which accepts a flag to check a single binary state of active which is determined by the global config AS being non-zero for active, inactive otherwise. The only way to become active is to call Start()[1], which creates a management operation to be handled by Serve(). This means ALL (both inactive and active state) management operations will block forever until Serve() is called in a separate goroutine, but Serve() is essentially idle beyond beyond being a proxy for management operations until Start() causes the state to be active.

Once running there are two methods to transition to a inactive sate, Stop() error and Shutdown(). The Stop() error method will notify neighbors it is exiting and update its config to the zero value which causes a inactive state. This leaves the current Serve() loop running which causes the GC to be unable to reclaim any memorys from fields the server still references. This is most fields since Stop() error only clears the Global config and leaves dangling references in all other fields. The Shutdown() function serves a similar purpose to Stop() except it will make a call to os.Exit() after notifying neighbors the event occurred.

After a cursory audit of the existing design, I believe we should remove the coupling between the management operations and network requests processing in the event loop. Once complete we will have the opportunity to improve the public facing API. I suggest we remove the event loop for management operations entirely, instead using a mutex for shared local state. This has the following benefits:

  • Reduces allocations because function closures as well as any captured parameters do not need heap allocated unless they escape.
  • Reduces general scheduler and runtime pressure which should allow the GC more room to breath until some other improvements are made to lower allocations.
  • Primary motivator improves library usage of the API which will naturally make it easier to integrate into the standard Go tooling for benchmarks and profiling. I've discovered several areas I could reduce allocations greatly, having test coverage will allow making safe, measurable changes.

Below is an example of the current AddPath being changed as proposed. The 3 less allocs are due to the memory semantics changing as described above.

// New add path - (not sure about the empty uuid?)
func (s *BgpServer) AddPath(vrfId string, pathList []*table.Path) (uuidBytes []byte, err error) {
	s.mtx.Lock()
	defer s.mtx.Unlock()
	if err = s.active(); err != nil {
		return
	}
	if err = s.fixupApiPath(vrfId, pathList); err != nil {
		return
	}
	if len(pathList) == 1 {
		pathList[0].AssignNewUUID()
	}
	s.propagateUpdate(nil, pathList)
	return
}

// New AddPath (Mutex):
//   BenchmarkAddPath-8 100000 12473 ns/op 4846 B/op 83 allocs/op
//
// Old AddPath (mgmtOperation):
//   BenchmarkAddPath-8 100000 19350 ns/op 5038 B/op 86 allocs/op
//
func BenchmarkAddPath(b *testing.B) {
	srv := MustStart(b, 1)
	p1 := &table.Policy{Name: "p1"}
	if err := srv.AddPolicy(p1, false); err != nil {
		b.Fatalf(`exp nil err from AddPolicy; got %v`, err)
	}

	log.SetLevel(log.WarnLevel)
	attrs := []bgp.PathAttributeInterface{
		bgp.NewPathAttributeOrigin(0),
		bgp.NewPathAttributeNextHop("10.0.0.1"),
	}
	b.ReportAllocs()
	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		c, d := i%5, i%255
		ip := fmt.Sprintf("10.0.%d.%d", c, d)
		pfx := bgp.NewIPAddrPrefix(32, ip)
		tbl := []*table.Path{table.NewPath(
			nil, pfx, false, attrs, time.Now(), false)}
		if _, err := srv.AddPath("", tbl); err != nil {
			b.Fatalf(`exp nil err from AddPath; got %v`, err)
		}
	}
}

Let me know if a pull request with the proposed changes would be acceptable, thanks!

-Chris

ishidawataru pushed a commit to ishidawataru/gobgp that referenced this issue May 24, 2017
fixes osrg#1331

Signed-off-by: Wataru Ishida <ishida.wataru@lab.ntt.co.jp>
ishidawataru pushed a commit to ishidawataru/gobgp that referenced this issue May 24, 2017
fixes osrg#1331

Signed-off-by: Wataru Ishida <ishida.wataru@lab.ntt.co.jp>
@ishidawataru
Copy link
Member

ishidawataru commented May 25, 2017

@cstockton Thanks for your suggestion.
It seems you are tackling both API usability and performance issue.
I'd like to focus on API usability first.

I think current problem is that BgpServer is managing two services, one is BGP service, and the other is service which starts and stops BGP service (BGP service manager).
Since the lifetime of the two services is different, the meaning of Stop(), Start(), Serve() gets ambiguous.

To avoid a radical change to the public API, I propose to provide a new method New() which starts both BGP service manager and BGP service at the same time.
Also, as a compromise, I propose to change the name Stop() to StopServer(), which is same with the corresponding gRPC API name and make Stop() stop both services.

I prototyped above. Please let me know what you think.

ishidawataru@7befc30

@cstockton
Copy link
Author

@ishidawataru Hello, thanks for the reply. The main reason I chose to use a lock rather than have run state management done over the channel is due to the chick and egg problem it creates currently, for example right now if you call gobgp.Start(cfg) before a call to go Serve() you are caught in a deadlock forever. So although performance is a natural benefit, my main motive is to allow better library usage so GoBGP can benefit from the Go standard tooling. The go test -race flag for example would be a good benefits since currently there are many race conditions in GoBGP that would be easier to debug and prevent future regressions using unit tests. For example "FSM" has goroutines created that will read config state from the GobgpServer, so modifying config or calling Stop() will always race against the FSM.

As for your change- I think the main PRO is that it allows the Serve() goroutine to exit. I think that the main CON is having New() create a goroutine will further compound the difficulty to fix the API as a whole down the road. The only reason this has to be done- is because the state transitions require it as mentioned above, this is why I suggest a lock, the server state can be mutated without a supporting goroutine event loop. But I did take some time to implement the changes and some issues surfaced that made me realize before the public API could be changed in a meaningful way for the GoBGP server, all of the supporting objects it uses must be fixed first. Otherwise making the GoBGP server "stoppable" and clear it's config is a race condition for FSM since it uses those structures, leaving potential for null pointer panics.

In short I would simply suggest that New() means one thing, allocate and initialize a object. Any public facing method that creates goroutines should accept a context.Context and pass this context to any goroutines it creates and block until the context or the work it performs is done.

Without this simple cooperation a service responsible for firing up a series of internal dependencies like the GobgpServer can struggle to properly exit without race conditions or other side effects. So GobgpServer can not accept context in a meaningful way until it is respected by its dependencies, like FSM and Watcher. Once those accept context the API for the GobgpServer can better support a cooperative run state with its dependents.

The tldr here is I think your change is GREAT internally to have Stop() end the event loop from Serve(). But I think leaving everything identical in the API would be better than adding MORE methods that do not take context and add more ambiguity about what is Stopping, shutting down, starting, etc. Just my personal opinion. Thanks for responding regardless.

@cstockton
Copy link
Author

@ishidawataru Would you accept patches that fixed race conditions and in the GoBGP server package? All changes could be made without affecting the public API initially. If not feel free to close this issue and thanks for the response.

@cstockton
Copy link
Author

@ishidawataru I was troubleshooting another race condition this weekend, that caused prefixes to end up being 0.0.0.0 due to a write during path serialization. It also seems the state machine Serve func still has some race conditions as well in master. I would still be happy to work towards some fixes and more use of context in the private API.

==================
WARNING: DATA RACE
Read at 0x00c420270358 by goroutine 68:
  github.com/osrg/gobgp/server.(*FSM).connectLoop()
      github.com/osrg/gobgp/server/fsm.go:392 +0x4ef
  github.com/osrg/gobgp/server.(*FSM).(github.com/osrg/gobgp/server.connectLoop)-fm()
      github.com/osrg/gobgp/server/fsm.go:226 +0x41
  gopkg.in/tomb%2ev2.(*Tomb).run()
      gopkg.in/tomb.v2/tomb.go:163 +0x38

Previous write at 0x00c420270358 by goroutine 10:
  github.com/osrg/gobgp/server.(*FSM).StateChange()
      github.com/osrg/gobgp/server/fsm.go:238 +0x4f6
  github.com/osrg/gobgp/server.(*BgpServer).handleFSMMessage()
      github.com/osrg/gobgp/server/server.go:821 +0x26ff
  github.com/osrg/gobgp/server.(*BgpServer).Serve.func1()
      github.com/osrg/gobgp/server/server.go:201 +0x2f5
  github.com/osrg/gobgp/server.(*BgpServer).Serve()
      github.com/osrg/gobgp/server/server.go:311 +0x741

Goroutine 68 (running) created at:
  gopkg.in/tomb%2ev2.(*Tomb).Go()
      gopkg.in/tomb.v2/tomb.go:159 +0x11b
  github.com/osrg/gobgp/server.NewFSM()
      github.com/osrg/gobgp/server/fsm.go:226 +0x6a6
  github.com/osrg/gobgp/server.NewPeer()
      github.com/osrg/gobgp/server/peer.go:111 +0x7b
  github.com/osrg/gobgp/server.(*BgpServer).addNeighbor()
      github.com/osrg/gobgp/server/server.go:1836 +0x4d0
  github.com/osrg/gobgp/server.(*BgpServer).AddNeighbor.func1()
      github.com/osrg/gobgp/server/server.go:1869 +0x4a
  github.com/osrg/gobgp/server.(*BgpServer).handleMGMTOp()
      github.com/osrg/gobgp/server/server.go:168 +0x63
  github.com/osrg/gobgp/server.(*BgpServer).Serve()
      github.com/osrg/gobgp/server/server.go:300 +0x7c9

Goroutine 10 (running) created at:
  main.main()
      github.com/osrg/gobgp/gobgpd/main.go:136 +0x467

@ishidawataru
Copy link
Member

Are you trying with the latest version?

serialization race should be fixed by the following commit.

cda7c44

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants