High Memory on Windows Server 2019 Core running version 2.29.0 #1182

zdunning13 · 2023-03-31T15:10:05Z

When running the Google Cloud Ops Agent version 2.29.0 or 2.27.0 we are seeing high memory utilization on Windows 2019 Core servers. The memory consumption is coming from fluent-bit.exe which is installed with the Ops Agent.

Memory utilization is stable when running on version 2.14.0.

Steps to reproduce the behavior:

Build custom Windows Server 2019 Core image based on the latest stable Google Cloud provided image. Include Ops Agent 2.27.0 or 2.29.0 in the image with no custom config
Create a GCE VM with the custom image & start it
Start the Ops Agent at VM startup
We see a spike in memory on the VM from fluent-bit.exe
Restarting the Ops Agent drops memory utilization immediately.

VM distro / OS - Windows Server 2019 Core
Ops Agent version - 2.27.0 & 2.29.0
Ops Agent configuration - no custom config

The text was updated successfully, but these errors were encountered:

braydonk · 2023-03-31T15:17:20Z

Hi @zdunning13 thanks for the report. I would recommend opening a support case for this, as going through the customer support channels is the best way we can gather the info we'd need to determine the issue. If you do open a support case, please respond back here with the case number.

zdunning13 · 2023-03-31T19:02:07Z

Thanks @braydonk. The Google Case number is 44256464.

braydonk · 2023-03-31T19:04:52Z

Thanks! This will be a big help for the investigation. I've started working on this, and will update the support case with more information as soon as I have anything to share.

22trevon · 2023-04-03T17:13:12Z

Created a support case as well with this issue. The GCP engineering team is aware of this.
For me, this makes my VMs memory max out and no new RDP connections can be established. The VMs have to be rebooted and the ops agent disabled.

braydonk · 2023-04-03T17:15:15Z

Thanks @22trevon, could you please send the case number here? We're getting a bunch of reports so we're gathering them together. Please ensure that you leave at least one VM as an example on your project so that you can run diagnostic script and so we can take a look at Ops Agent metrics.

In the meantime, you can try downgrading to Ops Agent 2.26, which should be okay. I suspect this is a regression from our upgrade to Fluent Bit major version 2, but 2.26 is still on 1.9.8.

PhilBrammer · 2023-04-03T17:42:42Z

Beware, 2.26 has a Windows time zone bug, which is super important to fix, btw. Windows event logs ingest just fine, but being are being recorded in the future, which is a monitoring problem.

braydonk · 2023-04-03T19:15:38Z

Thanks @PhilBrammer, yes if you have a custom parser with timestamps in your Ops Agent config on Windows then this bug will still be present in Ops Agent versions before the Fluent Bit 2 upgrade.

The customer case I do have access to mentions that this is with no custom config in the Ops Agent, so if that's the case for anyone else the timezone bug shouldn't affect you.

PhilBrammer · 2023-04-03T19:24:53Z

No. A standard Ops Agent > 2.15 and < 2.27 will report Windows event logs in local time (but skewed by GMT).

braydonk · 2023-04-03T19:40:53Z

Right, my mistake. Any VM with a timezone other than UTC was affected by this. So if this applies, downgrading to 2.14 is the best option, which is also what the original comment on this issue suggested they were using previously.

PhilBrammer · 2023-04-13T01:26:50Z

@braydonk and others that can offer help here: The Ops Agent known issues page needs to reflect this Windows time zone bug. Noting that here in case someone reading can make that change. New GCP customers pulling down 2.27 (latest) and being told to pin to 2.26 or older is going to cause issues for these customers if they are using Windows servers with non-UTC time zones.

jefferbrecht · 2023-04-13T13:57:54Z

The Windows time zone issue is already documented: https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent/troubleshoot-run-ingest#event_log_timestamps_are_wrong_on_windows

That issue has existed since the beginning (including 2.16 through 2.26), so the choice at the moment is to either use a newer version with the memory leak or an older version with the time zone issue (with the documented mitigations if needed). Once the fix for the leak is released then affected folks can upgrade to that version to get both fixes.

PhilBrammer · 2023-04-13T17:15:38Z

I'm not sure what you mean by "since the beginning." 2.14 and lower report correctly, and might be worth calling out in that doc.

To be clear, pinning to 2.14 solves both the memory problem and the Windows TZ bug.

jefferbrecht · 2023-04-13T17:38:35Z

No version before 2.27 has ever parsed time zones correctly. The bug has been present in fluent-bit since our initial release, so pinning to 2.14 will not solve the time zone bug.

A separate bug causes Windows Event Logs to not parse the timestamp at all on 2.14, which makes it default to the exporter's default time-of-ingress, which avoids the parsing issue and has the correct time zone -- but the timestamps are wrong anyway since they're not parsed.

Any custom timestamp parsers using parse_json or parse_regex that include time zones will not work correctly on 2.14.

PhilBrammer · 2023-04-13T17:52:38Z

This is completely incorrect. We spin up Windows Server images from GCP (their images; not custom) and deploy 2.14 Ops Agent and do not have a time zone problem. Going above 2.14 introduces it.

I guess in the end it is semantics. With 2.14 or lower, you'll see your Windows logs in Stackdriver with the correct time. Above 2.14, you'll have to wait your UTC offset in hours to see them, or look ahead with your timestamp filter. So mechanics aside, user experience matters.

jefferbrecht · 2023-04-13T18:42:55Z

We might be talking about different things. I don't want to lead users into believing that 2.14.0 is completely free of time zone issues, so let's make sure we're talking about the same thing.

Time zone parsing has never worked correctly prior to 2.27.0. This is trivially demonstrable using a parse_json or parse_regex processor with %z as I already mentioned. For example, using parse_json with time_format: "%Y-%m-%dT%H:%M:%S %z", and an input message {"message":"foo","time":"2023-04-13T13:37:00 -0500"}, the resulting timestamp in Cloud Logging is wrong on 2.14.0.

You are probably referring to the built-in Event Logs receiver. Version 2.14.0 does not parse Event Log timestamps in the first place (because of a different bug), so the time zone issue is hidden from windows_event_log receivers on that version. But because timestamps are not parsed, the time is wrong anyway: it does not reflect TimeGenerated as it should.

The example above shows only a couple seconds of error, which reflects the latency between generation and the Ops Agent picking it up. Importantly, if the Ops Agent is offline for some time, that latency will grow and cause the timestamps to be increasingly wrong once they're eventually picked up.

From 2.15.0 onwards, the bug that prevented Event Log timestamps from being parsed at all was fixed, which meant that the time zone parsing bug became unhidden. They're two separate issues.

Agreed that having it be wrong by only a couple seconds (on <= 2.14.0) is a better user experience than being wrong by several hours (on > 2.14.0), but we need to consider that many users are operating outside just the built-in use-case. Users with custom timestamp parsers, or users who need to turn off the Ops Agent sometimes, are still significantly affected on 2.14.0, because 2.14.0 does not completely solve the TZ bug.

PhilBrammer · 2023-04-13T19:53:56Z

Latency aside (that's the difference between generated and log "timestamp" unless overwritten in the parser), 2.14 logs show up in the correct TZ.

The field, "timestamp" being in UTC (Z) is not a problem. Logging will skew that as per UI settings. And further, the calculation is correct.

However, above 2.14, the log timestamps appear as local timezone, but in UTC (Z). So this causes "missing" logs (because when skewed for US time zones, the logs appear in the past).

Example, this log was generated in Windows at 2:50 PM US Central Time (UTC-5) on Ops Agent 2.16. Note the "receiveTimestamp" is correct, but Stackdriver doesn't use that field. "Timestamp" is 14:50Z, which should be 19:50Z)

This test machine was built at 2:39 PM US Central Time, and you can see the logs showing up in the histogram in the past.

PhilBrammer · 2023-04-13T19:58:45Z

I get what you're saying about mapping event timestamp to log timestamp solves this problem, but as it is now, I'm just trying to help out other customers that can't run the latest ops agent because of the memory issue. They'll likely run into log time zone issues if they are using Windows defaults and are forcing an older ops agent > 2.14.

braydonk · 2023-04-14T00:26:09Z

The memory problem should be resolved in the newly available Ops Agent 2.30 release! 🎉

Thanks everyone for your patience as I tracked this down. The incident dashboard should reflect the new status soon.

Those interested in technical details can find them in the upstream issues:
fluent/fluent-bit#6748
monkey/monkey#390

zdunning13 changed the title ~~High Memory on 2.29.0~~ High Memory on Windows Server 2019 Core running version 2.29.0 Mar 31, 2023

braydonk added the customer-case-attached Issues that have a customer case number attached to them for Googlers to investigate label Mar 31, 2023

braydonk self-assigned this Mar 31, 2023

braydonk closed this as completed Apr 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Memory on Windows Server 2019 Core running version 2.29.0 #1182

High Memory on Windows Server 2019 Core running version 2.29.0 #1182

zdunning13 commented Mar 31, 2023

braydonk commented Mar 31, 2023

zdunning13 commented Mar 31, 2023

braydonk commented Mar 31, 2023

22trevon commented Apr 3, 2023

braydonk commented Apr 3, 2023 •

edited

Loading

PhilBrammer commented Apr 3, 2023 •

edited

Loading

braydonk commented Apr 3, 2023

PhilBrammer commented Apr 3, 2023 •

edited

Loading

braydonk commented Apr 3, 2023

PhilBrammer commented Apr 13, 2023

jefferbrecht commented Apr 13, 2023

PhilBrammer commented Apr 13, 2023

jefferbrecht commented Apr 13, 2023

PhilBrammer commented Apr 13, 2023 •

edited

Loading

jefferbrecht commented Apr 13, 2023

PhilBrammer commented Apr 13, 2023 •

edited

Loading

PhilBrammer commented Apr 13, 2023 •

edited

Loading

braydonk commented Apr 14, 2023

High Memory on Windows Server 2019 Core running version 2.29.0 #1182

High Memory on Windows Server 2019 Core running version 2.29.0 #1182

Comments

zdunning13 commented Mar 31, 2023

braydonk commented Mar 31, 2023

zdunning13 commented Mar 31, 2023

braydonk commented Mar 31, 2023

22trevon commented Apr 3, 2023

braydonk commented Apr 3, 2023 • edited Loading

PhilBrammer commented Apr 3, 2023 • edited Loading

braydonk commented Apr 3, 2023

PhilBrammer commented Apr 3, 2023 • edited Loading

braydonk commented Apr 3, 2023

PhilBrammer commented Apr 13, 2023

jefferbrecht commented Apr 13, 2023

PhilBrammer commented Apr 13, 2023

jefferbrecht commented Apr 13, 2023

PhilBrammer commented Apr 13, 2023 • edited Loading

jefferbrecht commented Apr 13, 2023

PhilBrammer commented Apr 13, 2023 • edited Loading

PhilBrammer commented Apr 13, 2023 • edited Loading

braydonk commented Apr 14, 2023

braydonk commented Apr 3, 2023 •

edited

Loading

PhilBrammer commented Apr 3, 2023 •

edited

Loading

PhilBrammer commented Apr 3, 2023 •

edited

Loading

PhilBrammer commented Apr 13, 2023 •

edited

Loading

PhilBrammer commented Apr 13, 2023 •

edited

Loading

PhilBrammer commented Apr 13, 2023 •

edited

Loading