Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Memory on Windows Server 2019 Core running version 2.29.0 #1182

Closed
zdunning13 opened this issue Mar 31, 2023 · 18 comments
Closed

High Memory on Windows Server 2019 Core running version 2.29.0 #1182

zdunning13 opened this issue Mar 31, 2023 · 18 comments
Assignees
Labels
customer-case-attached Issues that have a customer case number attached to them for Googlers to investigate

Comments

@zdunning13
Copy link

When running the Google Cloud Ops Agent version 2.29.0 or 2.27.0 we are seeing high memory utilization on Windows 2019 Core servers. The memory consumption is coming from fluent-bit.exe which is installed with the Ops Agent.

Memory utilization is stable when running on version 2.14.0.

Steps to reproduce the behavior:

  1. Build custom Windows Server 2019 Core image based on the latest stable Google Cloud provided image. Include Ops Agent 2.27.0 or 2.29.0 in the image with no custom config
  2. Create a GCE VM with the custom image & start it
  3. Start the Ops Agent at VM startup
  4. We see a spike in memory on the VM from fluent-bit.exe
  5. Restarting the Ops Agent drops memory utilization immediately.
  • VM distro / OS - Windows Server 2019 Core
  • Ops Agent version - 2.27.0 & 2.29.0
  • Ops Agent configuration - no custom config
@zdunning13 zdunning13 changed the title High Memory on 2.29.0 High Memory on Windows Server 2019 Core running version 2.29.0 Mar 31, 2023
@braydonk
Copy link
Contributor

Hi @zdunning13 thanks for the report. I would recommend opening a support case for this, as going through the customer support channels is the best way we can gather the info we'd need to determine the issue. If you do open a support case, please respond back here with the case number.

@zdunning13
Copy link
Author

Thanks @braydonk. The Google Case number is 44256464.

@braydonk braydonk added the customer-case-attached Issues that have a customer case number attached to them for Googlers to investigate label Mar 31, 2023
@braydonk braydonk self-assigned this Mar 31, 2023
@braydonk
Copy link
Contributor

Thanks! This will be a big help for the investigation. I've started working on this, and will update the support case with more information as soon as I have anything to share.

@22trevon
Copy link

22trevon commented Apr 3, 2023

Created a support case as well with this issue. The GCP engineering team is aware of this.
For me, this makes my VMs memory max out and no new RDP connections can be established. The VMs have to be rebooted and the ops agent disabled.

@braydonk
Copy link
Contributor

braydonk commented Apr 3, 2023

Thanks @22trevon, could you please send the case number here? We're getting a bunch of reports so we're gathering them together. Please ensure that you leave at least one VM as an example on your project so that you can run diagnostic script and so we can take a look at Ops Agent metrics.

In the meantime, you can try downgrading to Ops Agent 2.26, which should be okay. I suspect this is a regression from our upgrade to Fluent Bit major version 2, but 2.26 is still on 1.9.8.

@PhilBrammer
Copy link

PhilBrammer commented Apr 3, 2023

Beware, 2.26 has a Windows time zone bug, which is super important to fix, btw. Windows event logs ingest just fine, but being are being recorded in the future, which is a monitoring problem.

@braydonk
Copy link
Contributor

braydonk commented Apr 3, 2023

Thanks @PhilBrammer, yes if you have a custom parser with timestamps in your Ops Agent config on Windows then this bug will still be present in Ops Agent versions before the Fluent Bit 2 upgrade.

The customer case I do have access to mentions that this is with no custom config in the Ops Agent, so if that's the case for anyone else the timezone bug shouldn't affect you.

@PhilBrammer
Copy link

PhilBrammer commented Apr 3, 2023

No. A standard Ops Agent > 2.15 and < 2.27 will report Windows event logs in local time (but skewed by GMT).

@braydonk
Copy link
Contributor

braydonk commented Apr 3, 2023

Right, my mistake. Any VM with a timezone other than UTC was affected by this. So if this applies, downgrading to 2.14 is the best option, which is also what the original comment on this issue suggested they were using previously.

@PhilBrammer
Copy link

@braydonk and others that can offer help here: The Ops Agent known issues page needs to reflect this Windows time zone bug. Noting that here in case someone reading can make that change. New GCP customers pulling down 2.27 (latest) and being told to pin to 2.26 or older is going to cause issues for these customers if they are using Windows servers with non-UTC time zones.

@jefferbrecht
Copy link
Member

The Windows time zone issue is already documented: https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent/troubleshoot-run-ingest#event_log_timestamps_are_wrong_on_windows

That issue has existed since the beginning (including 2.16 through 2.26), so the choice at the moment is to either use a newer version with the memory leak or an older version with the time zone issue (with the documented mitigations if needed). Once the fix for the leak is released then affected folks can upgrade to that version to get both fixes.

@PhilBrammer
Copy link

I'm not sure what you mean by "since the beginning." 2.14 and lower report correctly, and might be worth calling out in that doc.

To be clear, pinning to 2.14 solves both the memory problem and the Windows TZ bug.

@jefferbrecht
Copy link
Member

No version before 2.27 has ever parsed time zones correctly. The bug has been present in fluent-bit since our initial release, so pinning to 2.14 will not solve the time zone bug.

A separate bug causes Windows Event Logs to not parse the timestamp at all on 2.14, which makes it default to the exporter's default time-of-ingress, which avoids the parsing issue and has the correct time zone -- but the timestamps are wrong anyway since they're not parsed.

Any custom timestamp parsers using parse_json or parse_regex that include time zones will not work correctly on 2.14.

@PhilBrammer
Copy link

PhilBrammer commented Apr 13, 2023

This is completely incorrect. We spin up Windows Server images from GCP (their images; not custom) and deploy 2.14 Ops Agent and do not have a time zone problem. Going above 2.14 introduces it.

I guess in the end it is semantics. With 2.14 or lower, you'll see your Windows logs in Stackdriver with the correct time. Above 2.14, you'll have to wait your UTC offset in hours to see them, or look ahead with your timestamp filter. So mechanics aside, user experience matters.

@jefferbrecht
Copy link
Member

We might be talking about different things. I don't want to lead users into believing that 2.14.0 is completely free of time zone issues, so let's make sure we're talking about the same thing.

Time zone parsing has never worked correctly prior to 2.27.0. This is trivially demonstrable using a parse_json or parse_regex processor with %z as I already mentioned. For example, using parse_json with time_format: "%Y-%m-%dT%H:%M:%S %z", and an input message {"message":"foo","time":"2023-04-13T13:37:00 -0500"}, the resulting timestamp in Cloud Logging is wrong on 2.14.0.

image

You are probably referring to the built-in Event Logs receiver. Version 2.14.0 does not parse Event Log timestamps in the first place (because of a different bug), so the time zone issue is hidden from windows_event_log receivers on that version. But because timestamps are not parsed, the time is wrong anyway: it does not reflect TimeGenerated as it should.

image

The example above shows only a couple seconds of error, which reflects the latency between generation and the Ops Agent picking it up. Importantly, if the Ops Agent is offline for some time, that latency will grow and cause the timestamps to be increasingly wrong once they're eventually picked up.

From 2.15.0 onwards, the bug that prevented Event Log timestamps from being parsed at all was fixed, which meant that the time zone parsing bug became unhidden. They're two separate issues.

Agreed that having it be wrong by only a couple seconds (on <= 2.14.0) is a better user experience than being wrong by several hours (on > 2.14.0), but we need to consider that many users are operating outside just the built-in use-case. Users with custom timestamp parsers, or users who need to turn off the Ops Agent sometimes, are still significantly affected on 2.14.0, because 2.14.0 does not completely solve the TZ bug.

@PhilBrammer
Copy link

PhilBrammer commented Apr 13, 2023

Latency aside (that's the difference between generated and log "timestamp" unless overwritten in the parser), 2.14 logs show up in the correct TZ.
Screenshot 2023-04-13 at 1 52 29 PM

The field, "timestamp" being in UTC (Z) is not a problem. Logging will skew that as per UI settings. And further, the calculation is correct.

However, above 2.14, the log timestamps appear as local timezone, but in UTC (Z). So this causes "missing" logs (because when skewed for US time zones, the logs appear in the past).

Example, this log was generated in Windows at 2:50 PM US Central Time (UTC-5) on Ops Agent 2.16. Note the "receiveTimestamp" is correct, but Stackdriver doesn't use that field. "Timestamp" is 14:50Z, which should be 19:50Z)

Screenshot 2023-04-13 at 2 51 08 PM

This test machine was built at 2:39 PM US Central Time, and you can see the logs showing up in the histogram in the past.
Screenshot 2023-04-13 at 2 55 16 PM

@PhilBrammer
Copy link

PhilBrammer commented Apr 13, 2023

I get what you're saying about mapping event timestamp to log timestamp solves this problem, but as it is now, I'm just trying to help out other customers that can't run the latest ops agent because of the memory issue. They'll likely run into log time zone issues if they are using Windows defaults and are forcing an older ops agent > 2.14.

@braydonk
Copy link
Contributor

The memory problem should be resolved in the newly available Ops Agent 2.30 release! 🎉

Thanks everyone for your patience as I tracked this down. The incident dashboard should reflect the new status soon.

Those interested in technical details can find them in the upstream issues:
fluent/fluent-bit#6748
monkey/monkey#390

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer-case-attached Issues that have a customer case number attached to them for Googlers to investigate
Projects
None yet
Development

No branches or pull requests

5 participants