Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutdown should only be called if Start was called #9682

Open
djaglowski opened this issue Mar 4, 2024 · 8 comments
Open

Shutdown should only be called if Start was called #9682

djaglowski opened this issue Mar 4, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@djaglowski
Copy link
Member

Context: open-telemetry/opentelemetry-collector-contrib#31476 (comment)

Describe the bug
A failure to start any component results in a call to Shutdown on all components. Since some components never received a Start call to begin with, some expected state may not be present.

What did you expect to see?

We have had some previous problems with Shutdown causing panics because Start was never called. However, if I recall correctly were generally operate under the assumption that a component's Shutdown SHOULD only be called after the same component's Start is called.

What did you see instead?

Due to the behavior here, an error returned by any component results in Shutdown being called on every component.

What version did you use?

v0.95.0

Additional Context

Ideally, Shutdown functions should be resilient enough that this problem does not occur in any case. However, it may not always clear to component authors that they need to protect scenarios where Start was never called on the component.

Potential Solution

When iterating through the list of components to start, keep track of which ones have actually been started. When shutting down, call Shutdown on components which had Start called.

Another minor stability benefit here is that the shutdown order will perfectly correspond to the start order (though reversed). Currently, the order is determined independently via a topological sort. In some cases, this sort may be non-deterministic, so the exact sequence of components is not consistent. This is generally not a problem, but in theory could complicate troubleshooting which relies on component startup order.

@djaglowski djaglowski added the bug Something isn't working label Mar 4, 2024
@VihasMakwana
Copy link
Contributor

@djaglowski I can this up, it you haven't already planned to work on it.

@djaglowski
Copy link
Member Author

Thanks @VihasMakwana. I recommend we wait to hear from @open-telemetry/collector-maintainers before working on this.

@crobert-1
Copy link
Member

I've encountered this functionality of calling Shutdown without Start quite a few times while adding goleak to tests in various packages in core and contrib. The documentation link I always referred to with adding solutions to leaking goroutines was this. Specifically:

	// Shutdown is invoked during service shutdown. After Shutdown() is called, if the component
	// accepted data in any way, it should not accept it anymore.
	//
	// This method must be safe to call:
	//   - without Start() having been called

I totally understand the idea of "we shouldn't have to shutdown something that hasn't been started", but I have concerns if we remove this requirement.

Happy to share more details on my concern if there's more interest in removing this requirement, or if my documentation link is not relevant.

@djaglowski
Copy link
Member Author

Thanks for linking the documentation for this @crobert-1.

I have concerns if we remove this requirement.

To be clear, I'm not suggesting we remove the requirement. I just think that the cited issue demonstrates that due to interactions between components, it may a difficult problem to fully prevent. Therefore, we could choose to prevent some problems by avoiding calls to Shutdown which we know are at best no-ops.

@crobert-1
Copy link
Member

crobert-1 commented Mar 4, 2024

Okay, appreciate the clarification.

Referring back to the goleak work, I did find a number of components that start goroutines in their create methods, before Start is even called. Calling Shutdown on them every time means we don't end up leaking memory in that case. (Side note, I've yet to find a case where the goroutine had to be started there instead of Start, but I've left it since it's not a leak in this case. Also, New is called on every defined component in the configuration, even if it's not in a pipeline. This means that in the case where a component is defined but not in a pipeline, Shutdown is not called, so goroutine will be leaked. I guess I'm just realizing starting goroutines in New should probably not be allowed, but I'll have to investigate some more. NewFactory is called on every valid component, create is not called. From this, NewFactory should never run goroutines, but it's okay in create as long as Shutdown is called on every component that had create called.)

Panics are worse than memory leaks, but just wanted to bring up that we're hitting issues either way. Also, this seems like it's adding complexity for the sake of not requiring components to meet the specification, which doesn't seem right to me.

Interested to hear from others, just some things to consider. 👍

@dmitryax
Copy link
Member

dmitryax commented Mar 7, 2024

That's exactly why we have the lifecycle tests auto-generated for every component. One part of the test is to call Shutdown without calling start. This way we are making sure component meet the requirement. However, it can be skipped with a TODO which is the case for the journald receiver https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/journaldreceiver/metadata.yaml#L15

@djaglowski
Copy link
Member Author

@dmitryax, the difference here is that the lifecycle tests only validate each component in isolation. The problem here was an interaction between two components. I think it's a different kind of problem which we can't very well test for, but could prevent.

@atoulme
Copy link
Contributor

atoulme commented Mar 8, 2024

Please see the original report here: #6507

Lifecycle tests at least make sure we have each component being responsible of its own lifecycle. We can expand this over time. I believe one action we can take to allow for this is to have each component Shutdown method to be tested to run out of order.

The journald receiver is one of the last components that doesn't implement lifecycle tests, as per #27849. I would be happy to spend time to make it compatible with generated tests, and it will become more resilient as result. Between generated tests and go leak tests, we are trying to raise the bar for components so we can eventually try for more complex scenarios, such as the ones introduced with OpAmp which might make the collector able to reload its configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants