Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New ECS health check extension #5144

Closed
wants to merge 26 commits into from
Closed

Conversation

skyduo
Copy link
Contributor

@skyduo skyduo commented Sep 7, 2021

Description:
This is a new health check extension feature for AWS/ECS, as we are going to monitor the OT collector health through the return status from endpoint of the new extension, detailed design in the design doc below.

Link to tracking Issue:
open-telemetry/opentelemetry-collector#2573

Testing:
make otelcontribcol and run executable file

Documentation:
https://docs.google.com/document/d/1SpUMsWA2DeaoVazeQ8uEc1Wvu5LphmQU_TjzLmuJ4QM/edit#heading=h.rs1luwizct2w

original pr which has some discussion with @bogdandrutu and previous pr #4451

@skyduo skyduo requested review from a team and owais September 7, 2021 20:27
@gillg
Copy link
Contributor

gillg commented Sep 7, 2021

Hello, open question about this interesting new extension !
What is ECS specific here ? It could be a "smart health check" or something like that to avoid confusion no ? Doesn't it can be used outside ECS just behind a load balancer, or in kubernetes, or in swarm ?

Copy link
Contributor

@hossain-rayhan hossain-rayhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@skyduo
Copy link
Contributor Author

skyduo commented Sep 8, 2021

Hello, open question about this interesting new extension !
What is ECS specific here ? It could be a "smart health check" or something like that to avoid confusion no ? Doesn't it can be used outside ECS just behind a load balancer, or in kubernetes, or in swarm ?

so basically OT already has a health check extension for general usage, and we create this new one because we want to set some specific healthy condition for ECS, that's why it's a ECS health check extension.

@anuraaga
Copy link
Contributor

anuraaga commented Sep 8, 2021

@skyduo This seems to be a copy of #4451? Please avoid fragmenting discussion by opening duplicate PRs, we should probably close this one.

@skyduo
Copy link
Contributor Author

skyduo commented Sep 8, 2021

@skyduo This seems to be a copy of #4451? Please avoid fragmenting discussion by opening duplicate PRs, we should probably close this one.

hey Anuraag! i will close the previous one due to that one has some rebase issue

extension/awsecshealthcheckextension/Makefile Outdated Show resolved Hide resolved
extension/awsecshealthcheckextension/config.go Outdated Show resolved Hide resolved
}

func (hc *ecsHealthCheckExtension) Start(_ context.Context, host component.Host) error {
hc.logger.Info("Starting ECS health check extension", zap.Any("config", hc.config))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really need to emit the entire config at Info? Maybe have a separate entry with it at Debug if it is useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's fine, just tell the user this extension starts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll restate:

This does not need to emit the entire config at Info? Have a separate entry with it at Debug if it is useful?

extension/awsecshealthcheckextension/extension.go Outdated Show resolved Hide resolved
Comment on lines +98 to +101
hc.exporter.mu.Lock()
defer hc.exporter.mu.Unlock()

return hc.config.ExporterErrorLimit >= len(hc.exporter.exporterErrorQueue)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't access internals of the exporter like this. Add an accessor that can handle the locking itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this exporter is embedded in the extension and not used anywhere else, i think it's OK to lock and unlock like this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may be the current situation, but it is not good practice and it may not always be the case. If it is changed in the future then the person who changes it may not be aware of this use.

extension/awsecshealthcheckextension/extension.go Outdated Show resolved Hide resolved
@anuraaga
Copy link
Contributor

anuraaga commented Sep 8, 2021

In that case, let me copy in my comment. I think there is a recurring theme that people wonder why we are adding this extension since it doesn't have ECS-specific logic at all. We should all work together to improve the default health check instead as a community.

There doesn't seem to be anything ECS-specific in this code. Shouldn't the ability to become not-ready when there are an excessive number of exporter failures be added to the normal healthcheckextension instead of a new extension? This is broadly applicable.
Related is there is interest in rewriting that extension to use a system such as https://github.com/alexliesenfeld/health @jpkrohling @bogdandrutu

@gillg
Copy link
Contributor

gillg commented Sep 12, 2021

We should all work together to improve the default health check instead as a community.

Moreover, healthcheckextension is now on contrib instead of core, so it's probably more acceptable to make some eventual breaking changes to make it more complete and adapted to any situation.
Also I always don't understand what is specific with ECS. ECS use the native docker mechanism for container healthcheck, based on a shell command executed at container (CMD or CMD-SHELL) level. It can be a curl "ping" or anything else. If the helthckeck is used for a LoadBalancer used by ECS, it's just a ping made by the LB to an http target with a certain error thresold. Does someone can precise the exact need for ECS, missed by the base healthcheck ?

@skyduo skyduo force-pushed the newhealthcheck branch 2 times, most recently from 8321cd0 to ff20722 Compare September 16, 2021 16:55
Copy link
Member

@Aneurysm9 Aneurysm9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to echo the sentiments of @anuraaga and others. I don't see why this is positioned as an ECS health check and not an extension to the general health check. This provides a seemingly general capability and should not take the risk of deterring potential users by mislabeling itself.

@alolita
Copy link
Member

alolita commented Sep 17, 2021

@skyduo i agree with @Aneurysm9 @anuraaga and others that this is a good addition to the existing healthcheck extension. I don't think it needs to be AWS ECS specific. We don't want to be in the business of maintaining custom components wherever possible.

@PettitWesley
Copy link

@alolita @Aneurysm9 @anuraaga @skyduo I agree that this should be an addition/extra feature inside the existing health check extension. IIRC, this is what we first proposed when we first ever discussed our need/use case in a SIG meeting. However, the exact use case of making the health check fail if the exporter has errors was said to be very ECS specific by the folks in that meeting. So we were asked to pursue this route. If I remember correct, @tigrannajaryan and @bogdandrutu recommended that approach. I am happy if the community prefers we put all health check functionality in one extension. But want to double check that all stakeholders have weighed in.

@skyduo
Copy link
Contributor Author

skyduo commented Sep 17, 2021

yeah i have no objection on making it a general one, just like Wesley said if everyone agree with it

@alolita
Copy link
Member

alolita commented Sep 19, 2021

@PettitWesley @skyduo - thx for clarifying. Will bring this up for discussion w maintainers.

// initExporter function could register the customized exporter
func (hc *ecsHealthCheckExtension) initExporter() error {
hc.exporter = newECSHealthCheckExporter()
view.RegisterExporter(hc.exporter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be unregistered when the extension is shut down?

@skyduo
Copy link
Contributor Author

skyduo commented Sep 24, 2021

based on the discussion in the SIG meeting, we will move new health check strategy to the existing health check extension.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 2, 2021

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@jpkrohling
Copy link
Member

Should this be closed in favor of the other PR?

@bogdandrutu bogdandrutu removed the Stale label Oct 6, 2021
@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Oct 14, 2021
@jpkrohling
Copy link
Member

I'm closing this in favor of the other PR.

@jpkrohling jpkrohling closed this Oct 14, 2021
hex1848 pushed a commit to hex1848/opentelemetry-collector-contrib that referenced this pull request Jun 2, 2022
Because, most likely for the moment only otlp exporter/receiver is using this package, we can remove this after just one version, no third-party deps.

Signed-off-by: Bogdan Drutu <bogdandrutu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants