Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to ingest log entries to Stackdriver against new "k8s_container" and "k8s_node" resources. #62076

Merged
merged 1 commit into from
Apr 6, 2018

Conversation

qingling128
Copy link
Contributor

@qingling128 qingling128 commented Apr 3, 2018

What this PR does / why we need it:

Which issue(s) this PR fixes
Add support to ingest log entries to Stackdriver against new "k8s_container" and "k8s_node" resources.
Also Fluentd 0.14 has some memory leak issues that caused the e2e tests to be flaky. Downgrading to v0.12.

Special notes for your reviewer:
We never released any previous version with Fluentd v0.14. Only upgraded it very recently. So this downgrading is not visible to users.

Release note:

Add support to ingest log entries to Stackdriver against new "k8s_container" and "k8s_node" resources.

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 3, 2018
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Apr 3, 2018
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 3, 2018
@qingling128
Copy link
Contributor Author

Just signed the CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 3, 2018
@qingling128
Copy link
Contributor Author

CC: @piosz

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 3, 2018
@qingling128
Copy link
Contributor Author

Just for the reference, the original PR where we updated to an image using Fluentd v0.14 and adjusted the config file's parser section is at #59128.

As part of the downgrading to v0.12, we had to revert the parser section config change.

@qingling128
Copy link
Contributor Author

/assign @piosz

@qingling128 qingling128 changed the title Update fluentd_gcp_version to 0.2-1.5.30-1-k8s to include downgrading Fluentd to v0.12. Update fluentd_gcp_version to 0.2-1.5.30-1-k8s (downgrade Fluentd to v0.12) and remove k8s_compatibility labels. Apr 3, 2018
@qingling128 qingling128 changed the title Update fluentd_gcp_version to 0.2-1.5.30-1-k8s (downgrade Fluentd to v0.12) and remove k8s_compatibility labels. Update fluentd_gcp_version to 0.2-1.5.30-1-k8s (downgrade Fluentd to v0.12), remove k8s_compatibility labels and let Logging Agent talk to Metadata Agent. Apr 4, 2018
# resource. The format is:
# 'k8s_container.<namespace_name>.<pod_name>.<container_name>'.
"logging.googleapis.com/local_resource_id" ${"k8s_container.#{tag_suffix[4].rpartition('.')[0].split('_')[1]}.#{tag_suffix[4].rpartition('.')[0].split('_')[0]}.#{tag_suffix[4].rpartition('.')[0].split('_')[2].rpartition('-')[0]}"}
# Rename the field 'log' to a more generic field 'message'. This way the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For gke_container resource, we are sending log field as text_payload after time/severity/stream are extracted. For every other resources, we are sending message field as textPayload if it is the only field.

To assure the same behavior for k8s_container, we can either rename the log field to message in the configuration, or hardcode an exception in the fluent-plugin-google-cloud gem next to the gke_container resource.

I'm slightly leaning towards handling it via configuration instead of hardcoding it in the gem. The only issue is that when there are additional fields (other than time/severity/stream/log), the behavior might be different. For the gke_container resource, we seem to be dropping all other fields (That feel like a bug to me, isn't it?). For the k8s_container resource, we will keep the jsonPayload structure and retain the additional fields.

WDYT? @bmoyles0117 @igorpeshansky

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For gke_container, the implementation and the configuration on the Kubernetes side were carefully crafted to work together. This, as you observed, was very fragile.
The semantics you propose is simple and straightforward. We can document that the logging agent expects the text payload in a message field, and rename log to message in the configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to what Igor said.

@x13n
Copy link
Member

x13n commented Apr 4, 2018

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 4, 2018
args:
- /bin/sh
- -c
- STACKDRIVER_METADATA_AGENT_URL="http://${NODE_NAME}:8000" /usr/sbin/google-fluentd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't STACKDRIVER_METADATA_AGENT_URL be set via env?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, double-checking - is this the right port?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed STACKDRIVER_METADATA_AGENT_URL to be configurable via an env. PTAL.

Nice catch regarding to the port. We use 8000 in our testing cluster. The actual metadata agent yaml uses 8799 instead. Fixed.

@@ -366,6 +366,11 @@ data:
</record>
</filter>

# Do not collect fluentd's own logs to avoid infinite loops.
<match fluent.**>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean no logs from fluentd will be exported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is also the behavior of our default packaging.

It's on our roadmap to support exporting Fluentd logs. But it has to be handled carefully.

The issue right now is that when Fluentd runs into issues ingesting any logs, it emits a Fluentd log entry. When it has issues ingesting its own log, that triggers infinite loops and leads to agent crashes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ling, have you confirmed that this section does what it tries to do? As far as I'm aware, fluent logs are emitted to stdout/stderr, making this tag meaningless. If you've confirmed this behavior works as intended great, but if it doesn't add any value, I would suggest removing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this indeed takes effect. I encountered infinite loops and this change fixed it.

A sample ingested fluentd log looks like below:

{
 insertId:  "9ppb4lg2hn1rtn"  
 logName:  "projects/stackdriver-kubernetes-1337/logs/fluent.info"  
 receiveTimestamp:  "2018-04-04T01:27:05.638928314Z"  
 resource: {
  labels: {
   cluster_name:  "stackdriver-metadata-e2e"    
   location:  "us-central1-a"    
   node_name:  "gke-stackdriver-metadata-default-pool-c622577e-vlmk"    
   project_id:  "stackdriver-kubernetes-1337"    
  }
  type:  "k8s_node"   
 }
 textPayload:  "following tail of /var/log/containers/metadata-agent-api-availability-test-p469t_metadata-agent-api-availability-test_metadata-agent-api-availability-test-ef878dbc383e0e59b558103f2acfa446c4ca8d8520695460f5764e5a5719b468.log"  
 timestamp:  "2018-04-04T01:27:04Z"  
}

As we can see, Fluentd logs are tagged as fluent.info, fluent.error and fluent.warn.

This is because Fluentd logs are emitted as Fluentd events directly.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 4, 2018
Copy link
Contributor Author

@qingling128 qingling128 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All feedback addressed.

  • Grouped fluentd config setups into a setup-fluentd method.
  • Made metadata_agent_url configurable via env var.
  • Added an env var to control whether to ingest logs against old or new resources.

The contributor guide seems to suggest not to squash unless being suggested by reviewers. So I'll leave these commits separate for now.

# resource. The format is:
# 'k8s_container.<namespace_name>.<pod_name>.<container_name>'.
"logging.googleapis.com/local_resource_id" ${"k8s_container.#{tag_suffix[4].rpartition('.')[0].split('_')[1]}.#{tag_suffix[4].rpartition('.')[0].split('_')[0]}.#{tag_suffix[4].rpartition('.')[0].split('_')[2].rpartition('-')[0]}"}
# Rename the field 'log' to a more generic field 'message'. This way the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@@ -366,6 +366,11 @@ data:
</record>
</filter>

# Do not collect fluentd's own logs to avoid infinite loops.
<match fluent.**>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is also the behavior of our default packaging.

It's on our roadmap to support exporting Fluentd logs. But it has to be handled carefully.

The issue right now is that when Fluentd runs into issues ingesting any logs, it emits a Fluentd log entry. When it has issues ingesting its own log, that triggers infinite loops and leads to agent crashes.

args:
- /bin/sh
- -c
- STACKDRIVER_METADATA_AGENT_URL="http://${NODE_NAME}:8000" /usr/sbin/google-fluentd
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed STACKDRIVER_METADATA_AGENT_URL to be configurable via an env. PTAL.

Nice catch regarding to the port. We use 8000 in our testing cluster. The actual metadata agent yaml uses 8799 instead. Fixed.

@@ -0,0 +1,425 @@
# This ConfigMap is used to ingest logs against old resources like
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: This file is roughly a copy of fluentd-gcp-configmap.yaml. The only difference is whether to set up local_resource_id and enable talking to Metadata Agent. With those, logs are ingested against k8s_container and k8s_node. Without those, logs are ingested against gke_container and gce_instance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to what you've done here.

@@ -51,26 +57,18 @@ data:
pos_file /var/log/gcp-containers.log.pos
tag reform.*
read_from_head true
format none
format multi_format
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: This entire format change is to revert the changes we made in https://github.com/kubernetes/kubernetes/pull/59128/files#diff-dec77c261fefaa453d67b0c26b4b07c2 to adapt to Fluentd v0.14 syntax. As we downgrade back to v0.12 in this PR, the syntax needs to be reverted as well.

@@ -83,7 +81,19 @@ data:
<match reform.**>
@type record_reformer
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: This whole record_reformer section sets up the local_resource_id for k8s_container resource. It's present in fluentd-gcp-configmap.yaml only (not fluentd-gcp-configmap-old.yaml).

</match>

# Attach local_resource_id for 'k8s_node' monitored resource.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: This entire record_reformer section sets up the local_resource_id for k8s_node resource. It's present in fluentd-gcp-configmap.yaml only (not fluentd-gcp-configmap-old.yaml).

# the necessary resource types when this label is set.
"logging.googleapis.com/k8s_compatibility": "true"
}
# Use Metadata Agent to get monitored resource.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: This enable_metadata_agent true configuration is present in fluentd-gcp-configmap.yaml only (not fluentd-gcp-configmap-old.yaml).

"logging.googleapis.com/k8s_compatibility": "true"
}
# Use Metadata Agent to get monitored resource.
enable_metadata_agent true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: This enable_metadata_agent true configuration is present in fluentd-gcp-configmap.yaml only (not fluentd-gcp-configmap-old.yaml).

@@ -37,6 +37,10 @@ spec:
readOnly: true
- name: config-volume
mountPath: /etc/google-fluentd/config.d
args:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: STACKDRIVER_METADATA_AGENT_URL is set up as an environment variable regardless whether we are ingesting logs via old resources (in which case STACKDRIVER_METADATA_AGENT_URL is not needed) or new resources. Just because there seems to be no need to distinguish the case.

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with setting this unconditionally, however, why aren't we setting this as a normal environment variable? Injecting it at this point is actually problematic as it harms the ability to shut down cleanly.

        - env:
          - name: STACKDRIVER_METADATA_AGENT_URL
            value: {{ stackdriver_metadata_agent_url }}

Is what I'm suggesting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Changed.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 4, 2018
@qingling128 qingling128 changed the title Update fluentd_gcp_version to 0.2-1.5.30-1-k8s (downgrade Fluentd to v0.12), remove k8s_compatibility labels and let Logging Agent talk to Metadata Agent. Add support to ingest log entries against new "k8s_container" and "k8s_node" resources. Apr 4, 2018
@qingling128
Copy link
Contributor Author

/retest

@MaciekPytel
Copy link
Contributor

@qingling128 Please update release note, so it's clear this is specific to Stackdriver and not some general change in k8s logging (current version is way too specific for the target audience of Kubernetes users).

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 6, 2018
@qingling128 qingling128 changed the title Add support to ingest log entries against new "k8s_container" and "k8s_node" resources. Add support to ingest log entries to Stackdriver against new "k8s_container" and "k8s_node" resources. Apr 6, 2018
@qingling128
Copy link
Contributor Author

@MaciekPytel - Changed the PR title and release note. Should I squash everything into one commit now?

@MaciekPytel
Copy link
Contributor

Oh yes, I completely missed the number of commits. Please squash them.

/hold
So it doesn't merge before squashing.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 6, 2018
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 6, 2018
@qingling128
Copy link
Contributor Author

@MaciekPytel - Squashed.

@kawych
Copy link
Contributor

kawych commented Apr 6, 2018

/lgtm
@x13n is out of office, so approving instead of him as he lgtmed before squashing commits.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 6, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kawych, MaciekPytel, qingling128, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@qingling128
Copy link
Contributor Author

Thanks @kawych!

@MaciekPytel - PTAL.

@MaciekPytel
Copy link
Contributor

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 6, 2018
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Apr 6, 2018

@qingling128: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-100-performance 4ac1e9a8ee79b787a2513f82e031b29b7893c6f6 link /test pull-kubernetes-e2e-gce-100-performance
pull-kubernetes-kubemark-e2e-gce-big 4ac1e9a8ee79b787a2513f82e031b29b7893c6f6 link /test pull-kubernetes-kubemark-e2e-gce-big

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@qingling128
Copy link
Contributor Author

/test pull-kubernetes-e2e-kops-aws

@k8s-github-robot
Copy link

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@qingling128
Copy link
Contributor Author

Finally made it into the queue (there was some flakiness of the test) and merged. A cherrypick PR has been created as well.

@StevenACoffman
Copy link
Contributor

@qingling128 Thank you for your work on this. What is the issue tracking the fluentd memory leak so we can know when to try to upgrade again?

k8s-github-robot pushed a commit that referenced this pull request Apr 11, 2018
…62076-origin-release-1.10

Automatic merge from submit-queue.

Automated cherry pick of #62076: Add support to ingest log entries to Stackdriver against new

Cherry pick of #62076 on release-1.10.

#62076: Add support to ingest log entries to Stackdriver against new
@StevenACoffman
Copy link
Contributor

@monotek are you aware of any issue tracking the fluentd memory leak referred to here?

@monotek
Copy link
Member

monotek commented Apr 12, 2018

No.
Is there a bug report at fluentd about it?

@qingling128
Copy link
Contributor Author

@StevenACoffman @monotek - I've created an internal ticket to track it. Will report it with Fluentd as well and link it here. (Meant to do it earlier, but got distracted by the cherrypick issues)

@qingling128
Copy link
Contributor Author

Created fluent/fluentd#1941 to track the memory leak.

@StevenACoffman
Copy link
Contributor

StevenACoffman commented Apr 12, 2018

Hmmm... I am not seeing that with latest (only running for a few days) and there have been several leak fixes in versions in between:

@qingling128
Copy link
Contributor Author

@StevenACoffman - That sounds promising! We are also setting up some soak test for various Fluentd versions including the latest (v1.1). Some additional testing work is needed from our side to bump a major Fluentd version. We might not have resources for it this quarter, but it's definitely on our roadmap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.