Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ECI-466] Metric Forwarding In Go #40

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

sva91
Copy link
Contributor

@sva91 sva91 commented Feb 12, 2025

What

-Write Metrics Forwarder in Go

Steps to build multi-architecture image

  • Enable buildx : docker buildx create --use
  • Build and push image : docker buildx build --platform linux/amd64,linux/arm64 -t iad.ocir.io/iduhc9hzgn3o/metrics-forwarder-function/metrics-forwarder:0.0.3 . --no-cache --push

Why

  • Onboarding improvements

Testing

  • Unit tests
  • Staging Logs
  • Using the same multi-platform image with function application of ARM and x86 architectures
  • Invoked both the images through connector hub and it works fine

@sva91 sva91 added the eci label Feb 12, 2025
@sva91 sva91 requested a review from a team February 12, 2025 22:30
@sva91 sva91 marked this pull request as ready for review February 12, 2025 22:31
COPY . .

# Enable Go modules & build with optimizations
RUN go mod tidy && \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to support X86 as well?
I can't remember if OCI in Japan supports only ARM or only X86?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Japan only supports x86 but I think we said with Kanishk's stack improvements we won't need to worry about it later on

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well... we don't need to worry about it if we build for X86 and have multi arch images.
So I would say here we do need to build it as both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the image to build for both platforms. Also updated the PR description to have instructions on how to build such an image, and the testing I have done to ensure the same image works on function application of both architectures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the multi-arch image build is managed outside the docker file using buildx so that will be a separate step

}

func sendMetricsToDatadog(metricsMessage []byte, client HTTPClient) error {
endpoint := os.Getenv("DD_INTAKE_HOST")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this was the way in the Python version as well, so we can go with this for now...
Should we consider the DD API key a secret?
From our perspective maybe we don't care (i.e. we charge more if someone sends more stuff to us) but from the customer's standpoint?
Anyway not a blocker for this PR by any means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think DD API Key should be considered a secret, but I don't fully follow what changes are you recommending based on that. Are you suggesting that this should not be a configuration for function application, but sourced from somewhere else (OCI vault etc.) ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what I am saying, but of course it is not for this PR and maybe not even something we can/should do at all.
I am not recommending any changes today, but I think we should think about it as a team for the future.

Copy link
Contributor Author

@sva91 sva91 Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I definitely agree that this may require some discussion, as well as some experimentation. In my opinion, it is certainly a good idea to add API-KEY and APP-KEY to the customer's vault during initial stack execution, and replicate it in all regions at that time. We can then use it everywhere we need this.
I think this will be really nice for longer term. We do have to think about how to handle key rotations on expiration.

if shouldCompressPayload() {
compressedPayload, err := compressPayload(metricsMessage)
if err != nil {
log.Printf("Error compressing payload: %v", err)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not a blocker for this version, but ultimately I think we should report this kind of error to us via the EvP endpoint, so that we can monitor when errors occur in the fleet and fix them proactively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes complete sense, but I think that should be a separate PR with focus on "relaying information from forwarders to Datadog", so that we can think about holistically.

Copy link
Contributor

@bengillmandd bengillmandd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm just had some nitpicks

@sva91 sva91 requested a review from bengillmandd February 18, 2025 19:50
@sva91 sva91 changed the title Metric Forwarding In Go [ECI-466] Metric Forwarding In Go Feb 18, 2025

func CreateDatadogClient(site, apiKey string) DatadogClient {
configuration := datadog.NewConfiguration()
configuration.RetryConfiguration.EnableRetry = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to set max Retries and the retry timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default setting for these values are as follows (Link):

RetryConfiguration: RetryConfiguration{
	EnableRetry:       false,
	BackOffMultiplier: 2,
	BackOffBase:       2,
	HTTPRetryTimeout:  60 * time.Second,
	MaxRetries:        3,
}

If we want to change them, we can override, but they looked fine to me as is.

return err
}
if resp.StatusCode < 200 || resp.StatusCode >= 300 {
log.Printf("Error: Received non-200 response from Datadog: %d", resp.StatusCode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use log or fmt here considering we're using fmt to print the logs in other places. Same in handler.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. I experimented with both. Here is how these different log packages are treated in OCI.

  • Log package : These are written to STDERR by default. You will see "src" : "STDERR" in the log json.
  • fmt package : These are written to STDOUT by default. You will see "src" : "STDOUT" in the log json.

If a log is an error, I used the log package, else I used the fmt package.

"metrics-forwarder/internal/formatter"
)

var sendMetricsFunc = client.SendMetricsToDatadog
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can call datadog.NewAPIClient since this client is what is used for connection and as var can be re-used if the container is not destroyed.
Internally we can also decide the size of connection pool with this which is again created once

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I fully understand this comment. I am happy to chat more about this.

if status == "success" {
fdk.WriteStatus(out, 200)
} else {
fdk.WriteStatus(out, 500)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

500 status may not be accurate for scenarios like:

  • Invalid API key when getting a 403 response from the endpoint
  • missing API key, site or tenancy OCID (more closer to a 400 error)

Copy link
Contributor Author

@sva91 sva91 Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I am considering the response from function invocations as a pass or fail only. The status code from Datadog itself will be printed in OCI logs.

This particular status code becomes relevant if the function itself is placed behind an API Gateway, and then invoked through an API call. Now, for the client of the function itself, if an internal API call is failing, should we mask that or relay that information as is, could be debated.

Let me know if you want to have a discussion about this.

// Returns:
// - error: An error if the request preparation or API call fails, or if the
// response status code indicates a failure.
func SendMetricsToDatadog(client DatadogClient, metricsMessage []byte) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be a method of DatadogClient ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I fully understand the intent behind making this a function on struct, instead of passing the struct. This way, it helps me test the send function by creating a mock client.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants