-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High Memory on Windows Server 2019 Core running version 2.29.0 #1182
Comments
Hi @zdunning13 thanks for the report. I would recommend opening a support case for this, as going through the customer support channels is the best way we can gather the info we'd need to determine the issue. If you do open a support case, please respond back here with the case number. |
Thanks @braydonk. The Google Case number is |
Thanks! This will be a big help for the investigation. I've started working on this, and will update the support case with more information as soon as I have anything to share. |
Created a support case as well with this issue. The GCP engineering team is aware of this. |
Thanks @22trevon, could you please send the case number here? We're getting a bunch of reports so we're gathering them together. Please ensure that you leave at least one VM as an example on your project so that you can run diagnostic script and so we can take a look at Ops Agent metrics. In the meantime, you can try downgrading to Ops Agent 2.26, which should be okay. I suspect this is a regression from our upgrade to Fluent Bit major version 2, but 2.26 is still on 1.9.8. |
Thanks @PhilBrammer, yes if you have a custom parser with timestamps in your Ops Agent config on Windows then this bug will still be present in Ops Agent versions before the Fluent Bit 2 upgrade. The customer case I do have access to mentions that this is with no custom config in the Ops Agent, so if that's the case for anyone else the timezone bug shouldn't affect you. |
No. A standard Ops Agent > 2.15 and < 2.27 will report Windows event logs in local time (but skewed by GMT). |
Right, my mistake. Any VM with a timezone other than UTC was affected by this. So if this applies, downgrading to 2.14 is the best option, which is also what the original comment on this issue suggested they were using previously. |
@braydonk and others that can offer help here: The Ops Agent known issues page needs to reflect this Windows time zone bug. Noting that here in case someone reading can make that change. New GCP customers pulling down 2.27 (latest) and being told to pin to 2.26 or older is going to cause issues for these customers if they are using Windows servers with non-UTC time zones. |
The Windows time zone issue is already documented: https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent/troubleshoot-run-ingest#event_log_timestamps_are_wrong_on_windows That issue has existed since the beginning (including 2.16 through 2.26), so the choice at the moment is to either use a newer version with the memory leak or an older version with the time zone issue (with the documented mitigations if needed). Once the fix for the leak is released then affected folks can upgrade to that version to get both fixes. |
I'm not sure what you mean by "since the beginning." 2.14 and lower report correctly, and might be worth calling out in that doc. To be clear, pinning to 2.14 solves both the memory problem and the Windows TZ bug. |
No version before 2.27 has ever parsed time zones correctly. The bug has been present in fluent-bit since our initial release, so pinning to 2.14 will not solve the time zone bug. A separate bug causes Windows Event Logs to not parse the timestamp at all on 2.14, which makes it default to the exporter's default time-of-ingress, which avoids the parsing issue and has the correct time zone -- but the timestamps are wrong anyway since they're not parsed. Any custom timestamp parsers using |
This is completely incorrect. We spin up Windows Server images from GCP (their images; not custom) and deploy 2.14 Ops Agent and do not have a time zone problem. Going above 2.14 introduces it. I guess in the end it is semantics. With 2.14 or lower, you'll see your Windows logs in Stackdriver with the correct time. Above 2.14, you'll have to wait your UTC offset in hours to see them, or look ahead with your timestamp filter. So mechanics aside, user experience matters. |
We might be talking about different things. I don't want to lead users into believing that 2.14.0 is completely free of time zone issues, so let's make sure we're talking about the same thing. Time zone parsing has never worked correctly prior to 2.27.0. This is trivially demonstrable using a You are probably referring to the built-in Event Logs receiver. Version 2.14.0 does not parse Event Log timestamps in the first place (because of a different bug), so the time zone issue is hidden from The example above shows only a couple seconds of error, which reflects the latency between generation and the Ops Agent picking it up. Importantly, if the Ops Agent is offline for some time, that latency will grow and cause the timestamps to be increasingly wrong once they're eventually picked up. From 2.15.0 onwards, the bug that prevented Event Log timestamps from being parsed at all was fixed, which meant that the time zone parsing bug became unhidden. They're two separate issues. Agreed that having it be wrong by only a couple seconds (on <= 2.14.0) is a better user experience than being wrong by several hours (on > 2.14.0), but we need to consider that many users are operating outside just the built-in use-case. Users with custom timestamp parsers, or users who need to turn off the Ops Agent sometimes, are still significantly affected on 2.14.0, because 2.14.0 does not completely solve the TZ bug. |
I get what you're saying about mapping event timestamp to log timestamp solves this problem, but as it is now, I'm just trying to help out other customers that can't run the latest ops agent because of the memory issue. They'll likely run into log time zone issues if they are using Windows defaults and are forcing an older ops agent > 2.14. |
The memory problem should be resolved in the newly available Ops Agent 2.30 release! 🎉 Thanks everyone for your patience as I tracked this down. The incident dashboard should reflect the new status soon. Those interested in technical details can find them in the upstream issues: |
When running the Google Cloud Ops Agent version
2.29.0
or2.27.0
we are seeing high memory utilization on Windows 2019 Core servers. The memory consumption is coming fromfluent-bit.exe
which is installed with the Ops Agent.Memory utilization is stable when running on version
2.14.0
.Steps to reproduce the behavior:
fluent-bit.exe
The text was updated successfully, but these errors were encountered: