-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: server run state management and library usage #1331
Comments
fixes osrg#1331 Signed-off-by: Wataru Ishida <ishida.wataru@lab.ntt.co.jp>
fixes osrg#1331 Signed-off-by: Wataru Ishida <ishida.wataru@lab.ntt.co.jp>
@cstockton Thanks for your suggestion. I think current problem is that To avoid a radical change to the public API, I propose to provide a new method I prototyped above. Please let me know what you think. |
@ishidawataru Hello, thanks for the reply. The main reason I chose to use a lock rather than have run state management done over the channel is due to the chick and egg problem it creates currently, for example right now if you call As for your change- I think the main PRO is that it allows the Serve() goroutine to exit. I think that the main CON is having New() create a goroutine will further compound the difficulty to fix the API as a whole down the road. The only reason this has to be done- is because the state transitions require it as mentioned above, this is why I suggest a lock, the server state can be mutated without a supporting goroutine event loop. But I did take some time to implement the changes and some issues surfaced that made me realize before the public API could be changed in a meaningful way for the GoBGP server, all of the supporting objects it uses must be fixed first. Otherwise making the GoBGP server "stoppable" and clear it's config is a race condition for FSM since it uses those structures, leaving potential for null pointer panics. In short I would simply suggest that New() means one thing, allocate and initialize a object. Any public facing method that creates goroutines should accept a Without this simple cooperation a service responsible for firing up a series of internal dependencies like the GobgpServer can struggle to properly exit without race conditions or other side effects. So GobgpServer can not accept context in a meaningful way until it is respected by its dependencies, like FSM and Watcher. Once those accept context the API for the GobgpServer can better support a cooperative run state with its dependents. The tldr here is I think your change is GREAT internally to have |
@ishidawataru Would you accept patches that fixed race conditions and in the GoBGP server package? All changes could be made without affecting the public API initially. If not feel free to close this issue and thanks for the response. |
@ishidawataru I was troubleshooting another race condition this weekend, that caused prefixes to end up being 0.0.0.0 due to a write during path serialization. It also seems the state machine Serve func still has some race conditions as well in master. I would still be happy to work towards some fixes and more use of context in the private API.
|
Are you trying with the latest version? serialization race should be fixed by the following commit. |
Today the relation of Start, Stop, Serve and Shutdown is not immediately clear for an end user, making it difficult to use GoBGP as a library. I propose the GoBGP team defines the intended responsibility of Start, Stop, Serve and Shutdown. To kick off the discussion I'll create a summary inferred from the current design and the changes I would suggest from the perspective of an end user, we can go from there based on the backwards compatibility guarantees provided by the GoBGP team on what a good compromise is.
GoBGP Server guards against race conditions by synchronizing access to internal state through a single Go routine that executes within
Serve()
. The two primary types of work Serve receives are management operations and network requests from a listener. All management operations are all synchronized through a single functionmgmtOperation(f func() error, checkActive bool) error
which accepts a flag to check a single binary state of active which is determined by the global config AS being non-zero for active, inactive otherwise. The only way to become active is to callStart()
[1], which creates a management operation to be handled byServe()
. This means ALL (both inactive and active state) management operations will block forever untilServe()
is called in a separate goroutine, butServe()
is essentially idle beyond beyond being a proxy for management operations untilStart()
causes the state to be active.Once running there are two methods to transition to a inactive sate,
Stop() error
andShutdown()
. TheStop() error
method will notify neighbors it is exiting and update its config to the zero value which causes a inactive state. This leaves the currentServe()
loop running which causes the GC to be unable to reclaim any memorys from fields the server still references. This is most fields sinceStop() error
only clears the Global config and leaves dangling references in all other fields. TheShutdown()
function serves a similar purpose toStop()
except it will make a call toos.Exit()
after notifying neighbors the event occurred.After a cursory audit of the existing design, I believe we should remove the coupling between the management operations and network requests processing in the event loop. Once complete we will have the opportunity to improve the public facing API. I suggest we remove the event loop for management operations entirely, instead using a mutex for shared local state. This has the following benefits:
Below is an example of the current AddPath being changed as proposed. The 3 less allocs are due to the memory semantics changing as described above.
Let me know if a pull request with the proposed changes would be acceptable, thanks!
-Chris
The text was updated successfully, but these errors were encountered: