-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support to ingest log entries to Stackdriver against new "k8s_container" and "k8s_node" resources. #62076
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Just signed the CLA. |
CC: @piosz |
Just for the reference, the original PR where we updated to an image using Fluentd v0.14 and adjusted the config file's parser section is at #59128. As part of the downgrading to v0.12, we had to revert the parser section config change. |
/assign @piosz |
# resource. The format is: | ||
# 'k8s_container.<namespace_name>.<pod_name>.<container_name>'. | ||
"logging.googleapis.com/local_resource_id" ${"k8s_container.#{tag_suffix[4].rpartition('.')[0].split('_')[1]}.#{tag_suffix[4].rpartition('.')[0].split('_')[0]}.#{tag_suffix[4].rpartition('.')[0].split('_')[2].rpartition('-')[0]}"} | ||
# Rename the field 'log' to a more generic field 'message'. This way the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For gke_container
resource, we are sending log
field as text_payload
after time
/severity
/stream
are extracted. For every other resources, we are sending message
field as textPayload if it is the only field.
To assure the same behavior for k8s_container
, we can either rename the log
field to message
in the configuration, or hardcode an exception in the fluent-plugin-google-cloud gem next to the gke_container
resource.
I'm slightly leaning towards handling it via configuration instead of hardcoding it in the gem. The only issue is that when there are additional fields (other than time
/severity
/stream
/log
), the behavior might be different. For the gke_container
resource, we seem to be dropping all other fields (That feel like a bug to me, isn't it?). For the k8s_container
resource, we will keep the jsonPayload structure and retain the additional fields.
WDYT? @bmoyles0117 @igorpeshansky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For gke_container
, the implementation and the configuration on the Kubernetes side were carefully crafted to work together. This, as you observed, was very fragile.
The semantics you propose is simple and straightforward. We can document that the logging agent expects the text payload in a message
field, and rename log
to message
in the configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to what Igor said.
/ok-to-test |
args: | ||
- /bin/sh | ||
- -c | ||
- STACKDRIVER_METADATA_AGENT_URL="http://${NODE_NAME}:8000" /usr/sbin/google-fluentd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't STACKDRIVER_METADATA_AGENT_URL be set via env?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, double-checking - is this the right port?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed STACKDRIVER_METADATA_AGENT_URL
to be configurable via an env. PTAL.
Nice catch regarding to the port. We use 8000
in our testing cluster. The actual metadata agent yaml uses 8799
instead. Fixed.
@@ -366,6 +366,11 @@ data: | |||
</record> | |||
</filter> | |||
|
|||
# Do not collect fluentd's own logs to avoid infinite loops. | |||
<match fluent.**> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean no logs from fluentd will be exported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This is also the behavior of our default packaging.
It's on our roadmap to support exporting Fluentd logs. But it has to be handled carefully.
The issue right now is that when Fluentd runs into issues ingesting any logs, it emits a Fluentd log entry. When it has issues ingesting its own log, that triggers infinite loops and leads to agent crashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ling, have you confirmed that this section does what it tries to do? As far as I'm aware, fluent logs are emitted to stdout/stderr, making this tag meaningless. If you've confirmed this behavior works as intended great, but if it doesn't add any value, I would suggest removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this indeed takes effect. I encountered infinite loops and this change fixed it.
A sample ingested fluentd log looks like below:
{
insertId: "9ppb4lg2hn1rtn"
logName: "projects/stackdriver-kubernetes-1337/logs/fluent.info"
receiveTimestamp: "2018-04-04T01:27:05.638928314Z"
resource: {
labels: {
cluster_name: "stackdriver-metadata-e2e"
location: "us-central1-a"
node_name: "gke-stackdriver-metadata-default-pool-c622577e-vlmk"
project_id: "stackdriver-kubernetes-1337"
}
type: "k8s_node"
}
textPayload: "following tail of /var/log/containers/metadata-agent-api-availability-test-p469t_metadata-agent-api-availability-test_metadata-agent-api-availability-test-ef878dbc383e0e59b558103f2acfa446c4ca8d8520695460f5764e5a5719b468.log"
timestamp: "2018-04-04T01:27:04Z"
}
As we can see, Fluentd logs are tagged as fluent.info
, fluent.error
and fluent.warn
.
This is because Fluentd logs are emitted as Fluentd events directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All feedback addressed.
- Grouped fluentd config setups into a setup-fluentd method.
- Made metadata_agent_url configurable via env var.
- Added an env var to control whether to ingest logs against old or new resources.
The contributor guide seems to suggest not to squash unless being suggested by reviewers. So I'll leave these commits separate for now.
# resource. The format is: | ||
# 'k8s_container.<namespace_name>.<pod_name>.<container_name>'. | ||
"logging.googleapis.com/local_resource_id" ${"k8s_container.#{tag_suffix[4].rpartition('.')[0].split('_')[1]}.#{tag_suffix[4].rpartition('.')[0].split('_')[0]}.#{tag_suffix[4].rpartition('.')[0].split('_')[2].rpartition('-')[0]}"} | ||
# Rename the field 'log' to a more generic field 'message'. This way the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
@@ -366,6 +366,11 @@ data: | |||
</record> | |||
</filter> | |||
|
|||
# Do not collect fluentd's own logs to avoid infinite loops. | |||
<match fluent.**> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This is also the behavior of our default packaging.
It's on our roadmap to support exporting Fluentd logs. But it has to be handled carefully.
The issue right now is that when Fluentd runs into issues ingesting any logs, it emits a Fluentd log entry. When it has issues ingesting its own log, that triggers infinite loops and leads to agent crashes.
args: | ||
- /bin/sh | ||
- -c | ||
- STACKDRIVER_METADATA_AGENT_URL="http://${NODE_NAME}:8000" /usr/sbin/google-fluentd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed STACKDRIVER_METADATA_AGENT_URL
to be configurable via an env. PTAL.
Nice catch regarding to the port. We use 8000
in our testing cluster. The actual metadata agent yaml uses 8799
instead. Fixed.
@@ -0,0 +1,425 @@ | |||
# This ConfigMap is used to ingest logs against old resources like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: This file is roughly a copy of fluentd-gcp-configmap.yaml
. The only difference is whether to set up local_resource_id
and enable talking to Metadata Agent. With those, logs are ingested against k8s_container
and k8s_node
. Without those, logs are ingested against gke_container
and gce_instance
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to what you've done here.
@@ -51,26 +57,18 @@ data: | |||
pos_file /var/log/gcp-containers.log.pos | |||
tag reform.* | |||
read_from_head true | |||
format none | |||
format multi_format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: This entire format
change is to revert the changes we made in https://github.com/kubernetes/kubernetes/pull/59128/files#diff-dec77c261fefaa453d67b0c26b4b07c2 to adapt to Fluentd v0.14
syntax. As we downgrade back to v0.12
in this PR, the syntax needs to be reverted as well.
@@ -83,7 +81,19 @@ data: | |||
<match reform.**> | |||
@type record_reformer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: This whole record_reformer
section sets up the local_resource_id
for k8s_container
resource. It's present in fluentd-gcp-configmap.yaml
only (not fluentd-gcp-configmap-old.yaml
).
</match> | ||
|
||
# Attach local_resource_id for 'k8s_node' monitored resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: This entire record_reformer
section sets up the local_resource_id
for k8s_node
resource. It's present in fluentd-gcp-configmap.yaml
only (not fluentd-gcp-configmap-old.yaml
).
# the necessary resource types when this label is set. | ||
"logging.googleapis.com/k8s_compatibility": "true" | ||
} | ||
# Use Metadata Agent to get monitored resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: This enable_metadata_agent true
configuration is present in fluentd-gcp-configmap.yaml
only (not fluentd-gcp-configmap-old.yaml
).
"logging.googleapis.com/k8s_compatibility": "true" | ||
} | ||
# Use Metadata Agent to get monitored resource. | ||
enable_metadata_agent true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: This enable_metadata_agent true
configuration is present in fluentd-gcp-configmap.yaml
only (not fluentd-gcp-configmap-old.yaml
).
@@ -37,6 +37,10 @@ spec: | |||
readOnly: true | |||
- name: config-volume | |||
mountPath: /etc/google-fluentd/config.d | |||
args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: STACKDRIVER_METADATA_AGENT_URL
is set up as an environment variable regardless whether we are ingesting logs via old resources (in which case STACKDRIVER_METADATA_AGENT_URL
is not needed) or new resources. Just because there seems to be no need to distinguish the case.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with setting this unconditionally, however, why aren't we setting this as a normal environment variable? Injecting it at this point is actually problematic as it harms the ability to shut down cleanly.
- env:
- name: STACKDRIVER_METADATA_AGENT_URL
value: {{ stackdriver_metadata_agent_url }}
Is what I'm suggesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Changed.
/retest |
@qingling128 Please update release note, so it's clear this is specific to Stackdriver and not some general change in k8s logging (current version is way too specific for the target audience of Kubernetes users). /approve |
@MaciekPytel - Changed the PR title and release note. Should I squash everything into one commit now? |
Oh yes, I completely missed the number of commits. Please squash them. /hold |
…tainer" and "k8s_node" resources.
@MaciekPytel - Squashed. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kawych, MaciekPytel, qingling128, x13n The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks @kawych! @MaciekPytel - PTAL. |
/hold cancel |
@qingling128: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/test pull-kubernetes-e2e-kops-aws |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here. |
Finally made it into the queue (there was some flakiness of the test) and merged. A cherrypick PR has been created as well. |
@qingling128 Thank you for your work on this. What is the issue tracking the fluentd memory leak so we can know when to try to upgrade again? |
@monotek are you aware of any issue tracking the fluentd memory leak referred to here? |
No. |
@StevenACoffman @monotek - I've created an internal ticket to track it. Will report it with Fluentd as well and link it here. (Meant to do it earlier, but got distracted by the cherrypick issues) |
Created fluent/fluentd#1941 to track the memory leak. |
Hmmm... I am not seeing that with latest (only running for a few days) and there have been several leak fixes in versions in between: |
@StevenACoffman - That sounds promising! We are also setting up some soak test for various Fluentd versions including the latest (v1.1). Some additional testing work is needed from our side to bump a major Fluentd version. We might not have resources for it this quarter, but it's definitely on our roadmap. |
What this PR does / why we need it:
Which issue(s) this PR fixes
Add support to ingest log entries to Stackdriver against new "k8s_container" and "k8s_node" resources.
Also Fluentd 0.14 has some memory leak issues that caused the e2e tests to be flaky. Downgrading to v0.12.
Special notes for your reviewer:
We never released any previous version with Fluentd v0.14. Only upgraded it very recently. So this downgrading is not visible to users.
Release note: