Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Labels vanishing after 30 minutes of inactivity #453

Closed
candlerb opened this issue Apr 6, 2019 · 17 comments · Fixed by #521
Closed

Labels vanishing after 30 minutes of inactivity #453

candlerb opened this issue Apr 6, 2019 · 17 comments · Fixed by #521
Assignees
Labels
component/loki type/bug Somehing is not working as expected

Comments

@candlerb
Copy link
Contributor

candlerb commented Apr 6, 2019

Describe the bug
Test loki instance seemed to lose its labels after a while (30+ minutes of inactivity)

To Reproduce
Steps initially used to reproduce the behavior:

  1. Started loki (c994823) in a screen session
  2. Started promtail (c994823) to tail '/var/logs' using default config, in another screen session
  3. Show labels and log entries
  4. Go away for a while
  5. Show labels and log entries

EDIT: A better way to reproduce this is shown below. The above behaviour depends on an Ubuntu container running cron.hourly at 17 minutes past the hour, and otherwise not logging anything. From 47 minutes past the hour until 17 minutes past the next hour, labels had vanished.

Expected behavior

At startup I was able to read labels, and query logs via the API:

curl 'localhost:3100/api/prom/label'
curl 'localhost:3100/api/prom/label/job/values'
curl 'localhost:3100/api/prom/query?query=%7bjob="varlogs"%7d' | jq .
... saw a bunch of logs

But after going away and coming back again, no labels were shown:

# curl 'localhost:3100/api/prom/label'
{}
# curl 'localhost:3100/api/prom/label/job/values'
{}

And yet the chunks are still there:

# ls /tmp/loki/chunks/
 ZmFrZS83YTI1N2M5ZWI2YTYyMDkwOjE2OWYzMTZiODE2OjE2OWYzMTZiODYyOjdkNzMyNGQy
 ZmFrZS83YTI1N2M5ZWI2YTYyMDkwOjE2OWYzMzk2ZWExOjE2OWYzMzk2ZWExOjE1MGVkZDA2
 ZmFrZS83YTI1N2M5ZWI2YTYyMDkwOjE2OWYzNzA1Y2YzOjE2OWYzNzA1Y2YzOmY3YzNhNjk3
 ZmFrZS83ZTAyYmFmNDA1NmYyN2Q0OjE2OWYzMTZiODE5OjE2OWYzMWE2MjU0OjE0ZDliNGE4
 ZmFrZS83ZTAyYmFmNDA1NmYyN2Q0OjE2OWYzMjExYmEyOjE2OWYzMjI5ZmU0OjNkMzM2M2I3
 ZmFrZS84ZTk0MDgzMzdiZDZlYzA0OjE2OWYzMTZiODFhOjE2OWYzMTZiODJmOjIyYmFhYWI2
'ZmFrZS85N2RiMjQ0MDM5ZWFiOGY6MTY5ZjM3MDVjZjM6MTY5ZjM3MDVjZjM6OGNiNmIwMjY='
'ZmFrZS85N2RiMjQ0MDM5ZWFiOGY6MTY5ZjMxNmI4MTc6MTY5ZjMxNmI4MjI6YWM3NWFkZmI='
'ZmFrZS85N2RiMjQ0MDM5ZWFiOGY6MTY5ZjMzOTZlYTE6MTY5ZjMzOTZlYTE6YTc3MjQ0MDA='
 ZmFrZS85NDY0MWIwNjQ0MGRkNGE0OjE2OWYzMTZiODFiOjE2OWYzMTZiOGVmOjI0ODBjYWVh
 ZmFrZS85ZjY0NjJiM2YxZmM5ZmZlOjE2OWYzMTZiODE5OjE2OWYzMTZiODFmOjE3NjA0ODFk

After a few minutes, some logs reappeared:

# curl 'localhost:3100/api/prom/label'
{"values":["__filename__","job"]}
# curl 'localhost:3100/api/prom/query?query=%7bjob="varlogs"%7d' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1084  100  1084    0     0   132k      0 --:--:-- --:--:-- --:--:--  132k
{
  "streams": [
    {
      "labels": "{__filename__=\"/var/log/syslog\", job=\"varlogs\"}",
      "entries": [
        {
          "ts": "2019-04-06T17:02:01.225904931Z",
          "line": "Apr  6 17:02:01 loki CRON[13227]: (root) CMD (   test -x /etc/cron.daily/popularity-contest && /etc/cron.daily/popularity-contest --crond)"
        },
        {
          "ts": "2019-04-06T16:17:01.17174724Z",
          "line": "Apr  6 16:17:01 loki CRON[13204]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)"
        }
      ]
    },
    {
      "labels": "{__filename__=\"/var/log/auth.log\", job=\"varlogs\"}",
      "entries": [
        {
          "ts": "2019-04-06T17:02:01.22562594Z",
          "line": "Apr  6 17:02:01 loki CRON[13226]: pam_unix(cron:session): session closed for user root"
        },
        {
          "ts": "2019-04-06T17:02:01.22561571Z",
          "line": "Apr  6 17:02:01 loki CRON[13226]: pam_unix(cron:session): session opened for user root by (uid=0)"
        },
        {
          "ts": "2019-04-06T16:17:01.171625148Z",
          "line": "Apr  6 16:17:01 loki CRON[13203]: pam_unix(cron:session): session closed for user root"
        },
        {
          "ts": "2019-04-06T16:17:01.171614933Z",
          "line": "Apr  6 16:17:01 loki CRON[13203]: pam_unix(cron:session): session opened for user root by (uid=0)"
        }
      ]
    }
  ]
}

It's as if the API doesn't always return all available labels and log records. However the API documentation doesn't mention this.

Actually, the API documentation doesn't give any defaults for the query parameters. I would expect 'end' to default to the current time. It's unclear if there's some default for 'limit', and/or 'start' defaults to some fixed offset before the end time.

Environment:

  • Ubuntu 18.04 lxd container running on Ubuntu 18.04 host
  • Loki built from source with go 1.12.1

Additional
During all this time, no messages were logged to the screen where loki was running, except for the messages which were output when it started up.

# cd ~/go/src/github.com/grafana/loki; ./loki -config.file=./cmd/loki/loki-local-config.yaml
level=info ts=2019-04-06T14:44:43.408338242Z caller=loki.go:122 msg=initialising module=server
level=info ts=2019-04-06T14:44:43.408965936Z caller=gokit.go:36 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2019-04-06T14:44:43.409629113Z caller=loki.go:122 msg=initialising module=overrides
level=info ts=2019-04-06T14:44:43.409891026Z caller=override.go:33 msg="per-tenant overides disabled"
level=info ts=2019-04-06T14:44:43.410067021Z caller=loki.go:122 msg=initialising module=store
level=info ts=2019-04-06T14:44:43.410367709Z caller=loki.go:122 msg=initialising module=ingester
level=info ts=2019-04-06T14:44:43.410842862Z caller=loki.go:122 msg=initialising module=ring
level=info ts=2019-04-06T14:44:43.411310724Z caller=loki.go:122 msg=initialising module=distributor
level=info ts=2019-04-06T14:44:43.411437991Z caller=loki.go:122 msg=initialising module=querier
level=info ts=2019-04-06T14:44:43.413546771Z caller=loki.go:122 msg=initialising module=all
level=info ts=2019-04-06T14:44:43.413740112Z caller=main.go:56 msg="Starting Loki" version="(version=, branch=, revision=)"
level=info ts=2019-04-06T14:44:43.416685505Z caller=lifecycler.go:366 msg="entry not found in ring, adding with no tokens"
level=error ts=2019-04-06T14:44:43.416927349Z caller=consul_client.go:182 msg="error getting path" key=ring err=null
level=info ts=2019-04-06T14:44:43.417484986Z caller=lifecycler.go:296 msg="auto-joining cluster after timeout"
@candlerb
Copy link
Contributor Author

candlerb commented Apr 6, 2019

OK, it does seem to be something to do with start time; I am able to get all the logs I saw previously, by using:

# curl 'localhost:3100/api/prom/query?query=%7bjob="varlogs"%7d&start=1554556352000000000' | jq .

where that start time is date +%s - 14400, multiplied up to nanoseconds.

(Aside: it's a bit annoying that the JSON output has "ts" in ISO datetime format, while the query string uses Unix epoch nanoseconds. It would be nice to stick with one or the other)

But I still don't understand why the labels vanished. If I query even an empty time range, with start=end, I now get labels:

# curl 'localhost:3100/api/prom/label?start=1554567417000000000&end=1554567417000000000'
{"values":["__filename__","job"]}
# curl 'localhost:3100/api/prom/label?start=0&end=1'
{"values":["__filename__","job"]}

But I definitely wasn't getting any earlier, when writing up this ticket.

@candlerb
Copy link
Contributor Author

candlerb commented Apr 6, 2019

I was able to reproduce, just leaving the system alone.

# while [ 1 ]; do curl 'localhost:3100/api/prom/label'; sleep 300; done
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{}
{}
{}
{}
{}
{}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{}
{}
{}
{}
{}
{}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{}
{}
{}
{}
{}
{}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{}
{}
{}
{}
{}
{}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}
{"values":["__filename__","job"]}

Suggests that maybe labels are only shown for log entries in the past half hour? Is this documented?

# date
Sat Apr  6 21:32:29 UTC 2019
# curl 'localhost:3100/api/prom/query?query=%7bjob="varlogs"%7d' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   606  100   606    0     0  60600      0 --:--:-- --:--:-- --:--:-- 60600
{
  "streams": [
    {
      "labels": "{__filename__=\"/var/log/auth.log\", job=\"varlogs\"}",
      "entries": [
        {
          "ts": "2019-04-06T21:17:01.407876399Z",
          "line": "Apr  6 21:17:01 loki CRON[13472]: pam_unix(cron:session): session closed for user root"
        },
        {
          "ts": "2019-04-06T21:17:01.407867827Z",
          "line": "Apr  6 21:17:01 loki CRON[13472]: pam_unix(cron:session): session opened for user root by (uid=0)"
        }
      ]
    },
    {
      "labels": "{__filename__=\"/var/log/syslog\", job=\"varlogs\"}",
      "entries": [
        {
          "ts": "2019-04-06T21:17:01.401544901Z",
          "line": "Apr  6 21:17:01 loki CRON[13473]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)"
        }
      ]
    }
  ]
}

@candlerb
Copy link
Contributor Author

candlerb commented Apr 7, 2019

The label API endpoint doesn't appear to honour start and end. Testing during one of the times when the labels are missing:

# curl "localhost:3100/api/prom/label"
{}
# echo "localhost:3100/api/prom/label?start=$(($(date +%s)-3600))000000000&end=$(($(date +%s)))000000000"
localhost:3100/api/prom/label?start=1554621389000000000&end=1554624989000000000
# curl "localhost:3100/api/prom/label?start=$(($(date +%s)-3600))000000000&end=$(($(date +%s)))000000000"
{}
# curl "localhost:3100/api/prom/label?start=$(($(date +%s)-14400))000000000&end=$(($(date +%s)))000000000"
{}

The last query covers the time up to 4 hours ago, but still no labels are seen.

I also notice that logcli labels doesn't appear to send a start or end time in the request it sends, whereas logcli query does.

It's now pretty clear that the labels are visible for 30 minutes after HH:17, which is the time that cron.hourly runs.

# grep hourly /etc/crontab
17 *	* * *	root    cd / && run-parts --report /etc/cron.hourly

@daixiang0
Copy link
Contributor

now logcli can not query with timestamp.
Do you restart loki, now we do not support WAL(no write-ahead log implemented).

@candlerb
Copy link
Contributor Author

candlerb commented Apr 8, 2019

now logcli can not query with timestamp.

I'm using curl, not logcli, and I'm following the API documentation. It says that the /api/prom/label endpoint is "For retrieving the names of the labels one can query on". It does not say that it only returns labels which have been active in the last 30 minutes. Hence I was surprised that labels vanished from the list, and later reappeared.

This could be a question of fixing the documentation (if this is intentional behaviour), or it could be a question of fixing the the implementation.

One thing is for sure: I would be surprised to go into Grafana and query for available labels, and not find any. (This is what actually happened when I first tried Grafana+Loki; I switched to using the API directly to try to narrow down why logs were randomly disappearing).

Do you restart loki, now we do not support WAL(no write-ahead log implemented).

No - Loki was left running continuously for the whole test.

@daixiang0
Copy link
Contributor

https://github.com/grafana/loki/blob/master/pkg/querier/http.go#L20
it is a default setting.
@tomwilkie hi, is there any thought about this setting?

@candlerb
Copy link
Contributor Author

candlerb commented Apr 8, 2019

I'm not sure that's the same.

  • defaulSince (sic) is one hour before current time, but as you can see from my sleep 300 loop, labels disappear after 30 minutes.
  • I tried making calls to /api/prom/label with explicit start and end times, but it didn't make any difference. Therefore I think start/end are only used for log query, not label requests.

@candlerb
Copy link
Contributor Author

candlerb commented Apr 8, 2019

I did find this 30 minute constant:

./pkg/ingester/ingester.go:	f.DurationVar(&cfg.MaxChunkIdle, "ingester.chunks-idle-period", 30*time.Minute, "")

Could it be something here? When chunks are flushed, the labels are forgotten?

@candlerb candlerb changed the title Labels vanishing after a while Labels vanishing after 30 minutes of inactivity Apr 8, 2019
@candlerb
Copy link
Contributor Author

candlerb commented Apr 8, 2019

I will turn this into a reproducer which doesn't require promtail and doesn't depend on Ubuntu's cron.hourly behaviour.

  1. Install loki
  2. Ensure /tmp/loki doesn't exist
  3. Start loki in a separate terminal or screen session:
    cd ~/go/src/github.com/grafana/loki; ./loki -config.file=./cmd/loki/loki-local-config.yaml
  4. Add a log entry, check labels, wait >30 minutes, check labels again. Like this:
# curl -H Content-Type:application/json -d '{"streams":[{"labels":"{foo=\"bar\"}","entries":[{ "ts": "2018-12-18T08:28:06.801064-04:00", "line": "baz" }]}]}}' localhost:3100/api/prom/push
# curl localhost:3100/api/prom/label
{"values":["foo"]}
# sleep 1860; curl localhost:3100/api/prom/label
{}
# curl 'localhost:3100/api/prom/label?start=1545049686000000000&end=1545222486000000000'
{}

(Aside: for consistency I would have expected {"values":[]} here?)

But the main thing is, labels have vanished. Even if you give an explicit time range. The time range specified above is from 24 hours before to 24 hours after the given ts value.

irb(main):003:0> require 'time'
=> true
irb(main):004:0> t = Time.parse("2018-12-18T08:28:06.801064-04:00")
=> 2018-12-18 08:28:06 -0400
irb(main):005:0> t.to_i
=> 1545136086
irb(main):006:0> t.to_i - 86400
=> 1545049686
irb(main):007:0> t.to_i + 86400
=> 1545222486

@garo
Copy link
Contributor

garo commented Apr 9, 2019

Based on what I learned from the source code the /api/prom/label only queries the labels from the ingestion instances, which means that you are getting labels which are currently "active" (or have just recently been) (see https://github.com/grafana/loki/blob/master/pkg/querier/querier.go#L130)

When you query actual log data the query is sent to both ingestion nodes and the index backend. But for this you need to specify the "start" option as the label indices are spread along the time axis. I just created a PR #461 which describes this behaviour, but not the first (active labels).

Does this explain what you are getting? It's another question should this be improved.

@candlerb
Copy link
Contributor Author

candlerb commented Apr 9, 2019

Does this explain what you are getting?

Sounds like it could. The 30 minute ingester idle time is not documented anywhere.

This behaviour isn't good. When I'm in the Grafana UI, I cannot see any logs (including historical logs) until I've selected at least one label. But if that particular label hasn't been active for >30 minutes, it's not shown. This makes loki pretty useless.

@davkal
Copy link
Contributor

davkal commented Apr 10, 2019

This issue is being worked on over the next couple of days. Thanks for the thorough reporting.

@devAcad
Copy link

devAcad commented Apr 17, 2019

we are hitting the same issue. Any ETA on when the fix will be available?

@cyriltovena
Copy link
Contributor

Fix is in review !

Thank you.

@devAcad
Copy link

devAcad commented May 28, 2019

Any more update on this?

@rasple
Copy link

rasple commented Nov 25, 2019

Is there a fix yet? I'm having the same problem with latest docker image.

@slim-bean
Copy link
Collaborator

This particular issue was fixed with #521 where labels would be queried from the store.

If you are missing labels in explore you should expand out the timeframe of your query and reload the page (the reload is only required to work around a temporary change made in explore to accommodate slow label query performance which is fixed in the v11 schema update that will be made the default as soon as we figure out how to best adapt this in the helm chart)

@chaudum chaudum added the type/bug Somehing is not working as expected label Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/loki type/bug Somehing is not working as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants