-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flow prometheus remote write on Windows fails to replay WAL after restart #3826
Comments
After some investigating, this looks to be caused by the upstream Prometheus when trying to split the directory for the checkpoint. It splits the directory string on the period. This works on a unix path but not Windows as path.Base() does not properly split the path in Windows, returning the full path to the folder/file instead. This can be worked around by not using periods in any folder names for the WAL directory. |
Hi, thanks for reporting and investigating, your discover is really useful. I'll open a PR upstream to fix, and we'll see what we can do to bring it downstream to fix the issue.
This isn't something that can be done without a code change unfortunately, the component name (which always contains a |
This changes usage of path to be replaced with path/filepath, allowing for filepath.Base to properly return the base directory on systems where `/` is not the standard path separator. This resolves an issue on Windows where intermediate folders containing a `.` were incorrectly considered to be a part of the checkpoint name. Related to grafana/agent#3826. Signed-off-by: Robert Fratto <robertfratto@gmail.com>
This changes usage of path to be replaced with path/filepath, allowing for filepath.Base to properly return the base directory on systems where `/` is not the standard path separator. This resolves an issue on Windows where intermediate folders containing a `.` were incorrectly considered to be a part of the checkpoint name. Related to grafana/agent#3826. Signed-off-by: Robert Fratto <robertfratto@gmail.com> (cherry picked from commit 9e4e2a4)
This changes usage of path to be replaced with path/filepath, allowing for filepath.Base to properly return the base directory on systems where `/` is not the standard path separator. This resolves an issue on Windows where intermediate folders containing a `.` were incorrectly considered to be a part of the checkpoint name. Related to grafana/agent#3826. Signed-off-by: Robert Fratto <robertfratto@gmail.com> (cherry picked from commit 9e4e2a4)
Fixes grafana#3826 (cherry picked from commit b555681)
Fixes grafana#3826 (cherry picked from commit b555681)
Fixes grafana#3826 (cherry picked from commit b555681)
* flow: only use component health when component implements component.HealthComponent (#3740) * Revert "Default back to healthy when a component does not implement component.HealthComponent (#3558)" This reverts commit b2f8992. The commit introduced a bug where the "default health" message was always returned for healthy components, overriding the message informing the user when the last time the component evaluated was. * flow: only use component health if component exposes health This commit offers an alernative approach to #3558, where the overall health of a component only considers the component implementation of health if that component implements the component.HealthComponent interface. Without this commit, the zero value of component.Health was always returned, considering the health of every component as being "unknown." (cherry picked from commit ff2ca95) * cluster: honor deadline when opening TCP connection (#3764) Although synchronous writes to peers have a timeout, this timeout only affected the HTTP/2 request and not the net.Dial call. This commit attempts to honor the deadline when establishing a connection to a peer, otherwise it falls back to a default timeout of 30s. (cherry picked from commit f4c0e06) * Fix windows container pointing at the wrong location for config (#3775) * fix windows container image Signed-off-by: erikbaranowski <39704712+erikbaranowski@users.noreply.github.com> * Fix an issue with the windows grafana/agent windows docker image entrypoint not targeting the right location for the config. Signed-off-by: erikbaranowski <39704712+erikbaranowski@users.noreply.github.com> --------- Signed-off-by: erikbaranowski <39704712+erikbaranowski@users.noreply.github.com> (cherry picked from commit a7319fc) * node_exporter: fix usage of diskstat flags (#3760) Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com> (cherry picked from commit 6e4d3d6) * agent-flow: fix S3 object fetch when path has parent directories (#3800) * agent-flow: fix S3 object fetch when path has parent directories Signed-off-by: Jasti Sri Radhe Shyam <samabhasatejsrs@outlook.com> * add unit test case for S3 path parsing functionality and bug fix entry in changelog Signed-off-by: Jasti Sri Radhe Shyam <samabhasatejsrs@outlook.com> --------- Signed-off-by: Jasti Sri Radhe Shyam <samabhasatejsrs@outlook.com> Co-authored-by: mattdurham <mattdurham@ppog.org> (cherry picked from commit 0dda0ba) * misc: cherry-pick prometheus/prometheus#12349 (#3853) Fixes #3826 (cherry picked from commit b555681) * add logging if we fail to update a controller in operator (#3854) (cherry picked from commit 07cc2f4) * Update log level for a phlare.scrape log (#3813) * Update log level for a phlare.scrape log Signed-off-by: erikbaranowski <39704712+erikbaranowski@users.noreply.github.com> * changelog Signed-off-by: erikbaranowski <39704712+erikbaranowski@users.noreply.github.com> * Update CHANGELOG.md Co-authored-by: Robert Fratto <robertfratto@gmail.com> --------- Signed-off-by: erikbaranowski <39704712+erikbaranowski@users.noreply.github.com> Co-authored-by: Robert Fratto <robertfratto@gmail.com> (cherry picked from commit 9845ddd) * Make sure labels are cloned from loki.source.kubernetes (#3861) * Make sure labels are cloned from loki.source.kubernetes Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com> * Add changelog entry Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com> --------- Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com> (cherry picked from commit 8fb5be1) * update refs for v0.33.2 --------- Co-authored-by: Erik Baranowski <39704712+erikbaranowski@users.noreply.github.com> Co-authored-by: Paschalis Tsilias <tpaschalis@users.noreply.github.com> Co-authored-by: Jasti Sri Radhe Shyam <15701495+jastisriradheshyam@users.noreply.github.com> Co-authored-by: Craig Peterson <192540+captncraig@users.noreply.github.com>
What's wrong?
After restarting the Flow service or rebooting the Windows server, the agent will attempt to replay the WAL but get stuck in loop trying to replay the WAL:
After that, it will error ever minute with this:
And no metrics will be written to Mimir.
Stopping the agent and removing the WAL folder and starting it up again resolves it, but this is hard to maintain over a large amount of servers when doing updates.
Steps to reproduce
System information
Windows Server 2019 1809
Software version
Grafana Agent Flow v0.33.1
Configuration
The text was updated successfully, but these errors were encountered: