-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Service] Partial start can cause panic #6507
Comments
Should we call "Shutdown" only on the components that we started? |
I know we had spoken about this over slack but just to recap:
I don't think we should, mainly because that would imply that the service is managing that state and I think would over complicate cognitive model, and potential break the OpAMP. However, I do think that following the example of the behaviour for What I would expect as a definition of done:
cc: @tigrannajaryan since this may impact work with OpAMP |
@MovieStoreGuy can you expand on this? How can this break the OpAMP? |
My concern was that if we start mandating that components are not restartable, then OpAMP may have to change it works managing components. However, I believe we should keep restartable components however add in that |
We do not necessarily need the component instances to be restartable. The current restart logic is that we shutdown all components and create new components before we start them. This seems to be a cleaner solution since it removes the need to rely on component implementations being restartable. Independently from this, we need to decide who is responsible for tracking that Start was called and didn't fail so that Shutdown must be called for that component. I am reluctant to make this a component responsibility. If any component implement this state tracking incorrectly we will have a problem. I would prefer that the Service keeps track of Start-ed components and calls their Shutdown (only if Start succeeded). We can do this in one place in the Service and will make sure it it works correctly for all components. Not a strong opinion, but this seems to be a better option to me. |
@MovieStoreGuy I wanted to check in on that status of this issue as it's been open for a few months. We've run into this quite a few times with our distribution of the collector especially using the |
I agree with this. Restartability could introduce a lot of complexity. I would prefer we stick with this approach for now. The main downside of this approach is that state can be lost, but I think we can solve this in the future by supporting a notion of passing state to subsequent instances of a component. In any case, I think restartability is a separate concern from the original issue. |
It seems we've already done a few things to address this. (Clarification added in #6535, and tests that call However, I'm not convinced we're addressing a partial start, as called out in the original issue. When As a simple example, consider: type MyComp struct {
statefulThingOne
statefulThingTwo
}
func (c *MyComp) Start(...) error {
c.statefulThingOne := newState(...)
c.statefulThingTwo, err := mightFail(...) // returns nil, err in case of failure
return err
}
func (c *MyComp) Shutdown(...) error {
c.statefulThingOne.Stop()
c.statefulThingTwo.Stop() // oops, nil pointer
} If an error returned from If on the other hand, we call |
This is a pattern we can use to support partial starts: I am finding out that the tests we have in otelcontribcol do not do a good job catching all issues with a Shutdown call because they don't account for errors returned on creating the component. |
Describe the bug
On service start up, if a component was to error, the start up process is stopped and then the shutdown method is invoked.
If a component requires
Start
to be called in order to populate internal values (ie, storing a reference to a context.CancelFunc), theservice.Shutdown
will panic and the original error is lost.Steps to reproduce
If possible, provide a recipe for reproducing the error.
Have two components, one that errors on start up and another that requires
Start
to be called for theShutdown
method to be valid.For example the prometheus receiver requires
Start
to be called in order forShutdown
to be valid.https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/prometheusreceiver/metrics_receiver.go#L307-L312
However, I believe that is a seperate concern.
What did you expect to see?
I think there is a few things that are part of this is clarify the expected behaviour for
component.Componet
to include ifShutdown
can be called without a call toStart
.I think resolve that expectation will dictate what would be the following steps.
What did you see instead?
As the bug describes, it is possible for a graceful shutdown to panic.
What version did you use?
Version: v0.63.0
However, this impacts all versions I believe
What config did you use?
NA
Environment
OS: Darwin/arm & linux/arm (within a docker container)
Compiler(if manually compiled): go 1.18.4 and go 1.19.2
The text was updated successfully, but these errors were encountered: