Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

boltdb-shipper + retention_deletes: "object not found in storage" when the query goes over the retention-border #3058

Closed
pcbl opened this issue Dec 8, 2020 · 11 comments

Comments

@pcbl
Copy link

pcbl commented Dec 8, 2020

Describe the bug
I am using Loki 2.0 and have a Loki Configuration that uses Boltdb-shipper, fileSystem and retention for a Week(168h). Here is the vomplete configuration:

auth_enabled: false

server:
  log_level: 'warn'

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 5m
  chunk_retain_period: 30s

schema_config:
  configs:
  - from: 2018-04-15
    store: boltdb-shipper
    object_store: filesystem
    schema: v11
    index:
      prefix: index_
      period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: C:\ProgramData\POC\loki\index
    cache_location: C:\ProgramData\POC\loki\index\cache
    cache_ttl: 24h
    shared_store: filesystem
  filesystem:
    directory: C:\ProgramData\POC\loki\chunks
    
compactor:
  working_directory: C:\ProgramData\POC\loki\compactor
  shared_store: filesystem    

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

chunk_store_config:
  max_look_back_period: 0

table_manager:
  chunk_tables_provisioning:
    inactive_read_throughput: 0
    inactive_write_throughput: 0
    provisioned_read_throughput: 0
    provisioned_write_throughput: 0
  index_tables_provisioning:
    inactive_read_throughput: 0
    inactive_write_throughput: 0
    provisioned_read_throughput: 0
    provisioned_write_throughput: 0
  retention_deletes_enabled: true
  retention_period: 168h

I am facing a situation that after one week passed, the retention period is due and the table manager starts removing old data as expected. But it seems that for some reason the process of removing old data is leaving Loki on an inconsistent state and as soon as I perform some searches one day after the "retention-border", I am getting an object not found in storage error. Scenarios:

  • Search within the past 7 days(now-7d) (within retendion-border): OK!
  • Search on an old period that happened before the retention-border, let`s say From 11th to 10th past day (outside retention-border): OK!
  • Search on the day directly outside the retention border(8th day): object not found in storage
  • Search on a period that goes from within the retention border to something outside, From 1st day to 11th past day directly after the retention border(8th day): object not found in storage

Would that be connected with this issue here?
#2816

I am wondering if there`s a place where pre-release versions of loki are somewhere available so I could try the a build which contain this PR merged: #2855

@pcbl pcbl changed the title boltdb-shipper + retention_deletes_: "object not found in storage" when the query goes over the retention-border boltdb-shipper + retention_deletes: "object not found in storage" when the query goes over the retention-border Dec 8, 2020
@pcbl
Copy link
Author

pcbl commented Dec 8, 2020

One additional Information. On my Chunks index folder (C:\ProgramData\POC\loki\chunks\index), I have the following structure :

├───index_18591
│       compactor-1606353752.gz
│       EC2AMAZ-79T5VRA-1606231180783341000-1606353300.gz
│
├───index_18592
│       compactor-1606440150.gz
│       EC2AMAZ-79T5VRA-1606231180783341000-1606439700.gz
│
├───index_18593
│       compactor-1606526549.gz
│       EC2AMAZ-79T5VRA-1606231180783341000-1606526100.gz
│
├───index_18594
│       compactor-1606612947.gz
│       EC2AMAZ-79T5VRA-1606231180783341000-1606612500.gz
│
├───index_18595
│       compactor-1606699345.gz
│       EC2AMAZ-79T5VRA-1606231180783341000-1606698900.gz
│
├───index_18596
│       compactor-1606784374.gz
│       EC2AMAZ-79T5VRA-1606231180783341000-1606784400.gz
│
├───index_18597
│       compactor-1606870774.gz
│       EC2AMAZ-79T5VRA-1606231180783341000-1606870800.gz
│
└───index_18598
        compactor-1607418299.gz

I have noticed that the index_18591 folder is the one that seems "corrupted". As soon as I delete it, I got no longer any object not found in storage errors. To my mind when the retention process is removing old logs, it is somehow leaving some reference in the index that no longer exists, therefore throwing the exception...

I am on a kind of dead end as I am not sure if would be possible to overcome this issue.

@stale
Copy link

stale bot commented Jan 9, 2021

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Jan 9, 2021
@pcbl
Copy link
Author

pcbl commented Jan 11, 2021

The issue still happens, so please do not close it.

@stale stale bot removed the stale A stale issue or PR that will automatically be closed. label Jan 11, 2021
@pcbl
Copy link
Author

pcbl commented Jan 20, 2021

Just as an update... Just tested with Loki 2.1 and the issue still happens...

@ajs124
Copy link

ajs124 commented Feb 5, 2021

We're observing what seems to be the same issue.

@stale
Copy link

stale bot commented Mar 19, 2021

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Mar 19, 2021
@dbluxo
Copy link
Contributor

dbluxo commented Mar 25, 2021

We use S3 as storage backend and have configured different S3 bucket lifecycle configuration rules per sub-folder for different tenants, i.e. for some tenants we delete the chunks after 7 days, for some only after 31 days. As soon as a tenant makes a request that exceeds its lifecycle configuration time, it also gets an object not found in storage error back in Grafana.

@stale stale bot removed the stale A stale issue or PR that will automatically be closed. label Mar 25, 2021
@owen-d
Copy link
Member

owen-d commented May 6, 2021

Hey, you should set your max_look_back_period to be no longer than your retention period (which is set to retention_period: 168h in your example). This will ensure you don't try to look up chunks which have been deleted.

chunk_store_config:
  max_look_back_period: 0 <----change this:
  max_look_back_period: 0

Please see https://grafana.com/docs/loki/latest/configuration/#chunk_store_config for more details.

@owen-d owen-d closed this as completed May 6, 2021
@dbluxo
Copy link
Contributor

dbluxo commented May 6, 2021

@owen-d Unfortunately, it is not quite that simple. I tried to describe that we have set different retention times for different tenants in S3. Therefore, the approach you suggested unfortunately does not work. In my opinion, there should simply be no error message, but Grafana/Loki should display the logs that are retrievable.

@xeor
Copy link

xeor commented Jan 2, 2022

Any news here? I get this error after a cluster reinstall. I expect some missing data, but I still want whats available.. Any way to get what's there and ignore this error?

@craftyc0der
Copy link

I restarted my Loki pod on a new server and lost access to all historical data stored on S3. Seems strange.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants