Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flow prometheus remote write on Windows fails to replay WAL after restart #3826

Closed
dawei-nh opened this issue May 10, 2023 · 2 comments · Fixed by #3853
Closed

Flow prometheus remote write on Windows fails to replay WAL after restart #3826

dawei-nh opened this issue May 10, 2023 · 2 comments · Fixed by #3853
Assignees
Labels
bug Something isn't working frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Milestone

Comments

@dawei-nh
Copy link

What's wrong?

After restarting the Flow service or rebooting the Windows server, the agent will attempt to replay the WAL but get stuck in loop trying to replay the WAL:

ts=2023-05-10T00:22:46.5112732Z component=prometheus.remote_write.mimir subcomponent=rw level=info remote_name=d9bd36 url=https://my_mimir_url.com/api/v1/push msg="Replaying WAL" queue=d9bd36
ts=2023-05-10T00:22:46.5112732Z component=prometheus.remote_write.mimir subcomponent=rw level=error remote_name=d9bd36 url=https://my_mimir_url.com/api/v1/push msg="error tailing WAL" err="readCheckpoint: checkpointNum: invalid checkpoint dir string: C:\\ProgramData\\Grafana Agent Flow\\data\\prometheus.remote_write.mimir\\wal\\checkpoint.00000064"

After that, it will error ever minute with this:

ts=2023-05-10T00:23:46.5993796Z component=prometheus.remote_write.mimir subcomponent=rw level=error remote_name=d9bd36 url=https://my_mimir_url.com/api/v1/push msg="error tailing WAL" err="readCheckpoint: checkpointNum: invalid checkpoint dir string: C:\\ProgramData\\Grafana Agent Flow\\data\\prometheus.remote_write.mimir\\wal\\checkpoint.00000064"

And no metrics will be written to Mimir.

Stopping the agent and removing the WAL folder and starting it up again resolves it, but this is hard to maintain over a large amount of servers when doing updates.

Steps to reproduce

  1. Configure Grafana Agent Flow on Windows with a prometheus remote_write component
  2. Let the WAL accumulate
  3. Restart the Grafana Agent Flow service

System information

Windows Server 2019 1809

Software version

Grafana Agent Flow v0.33.1

Configuration

logging {
	level = "info"
}

prometheus.exporter.windows "this" {
  enabled_collectors = ["adfs","cpu","cs","logical_disk","net","os","process","system"]
}

prometheus.scrape "windows" {
  targets    = prometheus.exporter.windows.this.targets
  forward_to = [prometheus.relabel.this.receiver]
  
  scrape_interval = "15s"
}

prometheus.relabel "this" {
  forward_to = [prometheus.remote_write.mimir.receiver]
  
  rule {
    action       = "replace"
	target_label = "instance"
	replacement  = constants.hostname
  }
}

prometheus.remote_write "mimir" {
  endpoint {
    url = "https://my-mimir-url.com/api/v1/push"
  basic_auth {
    username = "foo"
    password = "bar"
    }
  }
}


### Logs

_No response_
@dawei-nh dawei-nh changed the title Flow on Windows fails to replay WAL after restart Flow prometheus remote write on Windows fails to replay WAL after restart May 10, 2023
@dawei-nh
Copy link
Author

After some investigating, this looks to be caused by the upstream Prometheus when trying to split the directory for the checkpoint.

https://github.com/prometheus/prometheus/blob/bd98fc8c45524609590ef757a5f32ee7f0f0d1c7/tsdb/wlog/watcher.go#L691-L697

It splits the directory string on the period. This works on a unix path but not Windows as path.Base() does not properly split the path in Windows, returning the full path to the folder/file instead.

This can be worked around by not using periods in any folder names for the WAL directory.

@rfratto rfratto added bug Something isn't working type/signals labels May 10, 2023
@rfratto
Copy link
Member

rfratto commented May 10, 2023

Hi, thanks for reporting and investigating, your discover is really useful. I'll open a PR upstream to fix, and we'll see what we can do to bring it downstream to fix the issue.

This can be worked around by not using periods in any folder names for the WAL directory.

This isn't something that can be done without a code change unfortunately, the component name (which always contains a .) is hard-coded to be in the WAL directory.

rfratto added a commit to rfratto/prometheus that referenced this issue May 10, 2023
This changes usage of path to be replaced with path/filepath, allowing
for filepath.Base to properly return the base directory on systems where
`/` is not the standard path separator.

This resolves an issue on Windows where intermediate folders containing
a `.` were incorrectly considered to be a part of the checkpoint name.

Related to grafana/agent#3826.

Signed-off-by: Robert Fratto <robertfratto@gmail.com>
@rfratto rfratto self-assigned this May 10, 2023
@rfratto rfratto moved this from Todo to In Progress in Grafana Agent (Public) May 10, 2023
@rfratto rfratto added this to the v0.34.0 milestone May 10, 2023
rfratto added a commit to grafana/prometheus that referenced this issue May 11, 2023
This changes usage of path to be replaced with path/filepath, allowing
for filepath.Base to properly return the base directory on systems where
`/` is not the standard path separator.

This resolves an issue on Windows where intermediate folders containing
a `.` were incorrectly considered to be a part of the checkpoint name.

Related to grafana/agent#3826.

Signed-off-by: Robert Fratto <robertfratto@gmail.com>
(cherry picked from commit 9e4e2a4)
rfratto added a commit to rfratto/agent that referenced this issue May 11, 2023
rfratto added a commit that referenced this issue May 11, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in Grafana Agent (Public) May 11, 2023
rfratto added a commit to grafana/prometheus that referenced this issue May 11, 2023
This changes usage of path to be replaced with path/filepath, allowing
for filepath.Base to properly return the base directory on systems where
`/` is not the standard path separator.

This resolves an issue on Windows where intermediate folders containing
a `.` were incorrectly considered to be a part of the checkpoint name.

Related to grafana/agent#3826.

Signed-off-by: Robert Fratto <robertfratto@gmail.com>
(cherry picked from commit 9e4e2a4)
rfratto added a commit to rfratto/agent that referenced this issue May 11, 2023
rfratto added a commit to rfratto/agent that referenced this issue May 11, 2023
rfratto added a commit to rfratto/agent that referenced this issue May 11, 2023
rfratto added a commit that referenced this issue May 11, 2023
* flow: only use component health when component implements component.HealthComponent (#3740)

* Revert "Default back to healthy when a component does not implement component.HealthComponent (#3558)"

This reverts commit b2f8992. The commit
introduced a bug where the "default health" message was always returned
for healthy components, overriding the message informing the user when
the last time the component evaluated was.

* flow: only use component health if component exposes health

This commit offers an alernative approach to #3558, where the overall
health of a component only considers the component implementation of
health if that component implements the component.HealthComponent
interface.

Without this commit, the zero value of component.Health was always
returned, considering the health of every component as being "unknown."

(cherry picked from commit ff2ca95)

* cluster: honor deadline when opening TCP connection (#3764)

Although synchronous writes to peers have a timeout, this timeout only
affected the HTTP/2 request and not the net.Dial call. This commit
attempts to honor the deadline when establishing a connection to a peer,
otherwise it falls back to a default timeout of 30s.

(cherry picked from commit f4c0e06)

* Fix windows container pointing at the wrong location for config (#3775)

* fix windows container image

Signed-off-by: erikbaranowski <39704712+erikbaranowski@users.noreply.github.com>

* Fix an issue with the windows grafana/agent windows docker image entrypoint
  not targeting the right location for the config.

Signed-off-by: erikbaranowski <39704712+erikbaranowski@users.noreply.github.com>

---------

Signed-off-by: erikbaranowski <39704712+erikbaranowski@users.noreply.github.com>
(cherry picked from commit a7319fc)

* node_exporter: fix usage of diskstat flags (#3760)

Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>
(cherry picked from commit 6e4d3d6)

* agent-flow: fix S3 object fetch when path has parent directories (#3800)

* agent-flow: fix S3 object fetch when path has parent directories

Signed-off-by: Jasti Sri Radhe Shyam <samabhasatejsrs@outlook.com>

* add unit test case for S3 path parsing functionality and bug fix entry in changelog

Signed-off-by: Jasti Sri Radhe Shyam <samabhasatejsrs@outlook.com>

---------

Signed-off-by: Jasti Sri Radhe Shyam <samabhasatejsrs@outlook.com>
Co-authored-by: mattdurham <mattdurham@ppog.org>
(cherry picked from commit 0dda0ba)

* misc: cherry-pick prometheus/prometheus#12349 (#3853)

Fixes #3826

(cherry picked from commit b555681)

* add logging if we fail to update a controller in operator (#3854)

(cherry picked from commit 07cc2f4)

* Update log level for a phlare.scrape log (#3813)

* Update log level for a phlare.scrape log

Signed-off-by: erikbaranowski <39704712+erikbaranowski@users.noreply.github.com>

* changelog

Signed-off-by: erikbaranowski <39704712+erikbaranowski@users.noreply.github.com>

* Update CHANGELOG.md

Co-authored-by: Robert Fratto <robertfratto@gmail.com>

---------

Signed-off-by: erikbaranowski <39704712+erikbaranowski@users.noreply.github.com>
Co-authored-by: Robert Fratto <robertfratto@gmail.com>
(cherry picked from commit 9845ddd)

* Make sure labels are cloned from loki.source.kubernetes (#3861)

* Make sure labels are cloned from loki.source.kubernetes

Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>

* Add changelog entry

Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>

---------

Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>
(cherry picked from commit 8fb5be1)

* update refs for v0.33.2

---------

Co-authored-by: Erik Baranowski <39704712+erikbaranowski@users.noreply.github.com>
Co-authored-by: Paschalis Tsilias <tpaschalis@users.noreply.github.com>
Co-authored-by: Jasti Sri Radhe Shyam <15701495+jastisriradheshyam@users.noreply.github.com>
Co-authored-by: Craig Peterson <192540+captncraig@users.noreply.github.com>
tpaschalis pushed a commit to tpaschalis/agent that referenced this issue May 16, 2023
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

2 participants