Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

otelcol-contrib file_storage does not recover gracefully upon potentially corrupted database #35899

Open
briandavis-viz opened this issue Oct 21, 2024 · 3 comments
Labels
bug Something isn't working extension/storage/filestorage needs triage New item requiring triage

Comments

@briandavis-viz
Copy link

Component(s)

extension/storage/filestorage

What happened?

Description

When there are issues with underlying a linux guest storage, mostly in scenarios where the hypervisor loses connectivity or experiences extremely high latency with the NAS/SAN, the database can be in a corrupted state, which requires completely deleting the filesstorage queue, to allow otel to function (or even start running) again.

Steps to Reproduce

While a VM is running, try to remove it's storage beneath it, potentially in the middle of otel writing to the bolt filestorage.

This may also be caused by machines being improperly shut down (like power loss.)

Expected Result

Some data will be lost, but the filestorage extension will recover gracefully and be able to resume running.

Actual Result

The otelcol service attempts to start, but fails to do so when loading the filestorage extension. This requires manual intervention to delete the specific filestorage file that was corrupted, and then to start the service.

Collector version

0.106.1

Environment information

Environment

OS: Ubuntu 22.04
Compiler(if manually compiled): N/A, using https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.106.1/otelcol-contrib_0.106.1_linux_amd64.deb

OpenTelemetry Collector configuration

# This is a significantly reduced config (removed additional log receivers, filters, etc.) but all the required components are here, and all the items related to the file_storage extension, queues, etc. is untouched.
# ---------------------------------------------------- #
# ---------------  EXTENSIONS ------------------------ #
# ---------------------------------------------------- #
extensions:
  file_storage/fingerprint:
    directory: /var/lib/otelcol/fingerprint
    timeout: 10s
    compaction:
      on_rebound: true
      directory: /data/otelcol/compaction
      check_interval: 5m
      rebound_trigger_threshold_mib: 50
      rebound_needed_threshold_mib: 500

  file_storage/persistent_queue_storage:
    directory: /data/otelcol/persistent_queue_storage
    timeout: 10s
    compaction:
      on_rebound: true
      directory: /data/otelcol/compaction
      check_interval: 5m
      rebound_trigger_threshold_mib: 50
      rebound_needed_threshold_mib: 500


# ---------------------------------------------------- #
# -------------  RECEIVERS --------------------------- #
# ---------------------------------------------------- #
receivers:
  filelog/syslog:
    include: "/var/log/syslog"
    start_at: beginning
    include_file_path: true
    max_log_size: 100MiB
    storage: file_storage/fingerprint

  prometheus:
    trim_metric_suffixes: true
    config:
      scrape_configs:
      - job_name: 'prometheus-node-exporter'
        scrape_interval: 1m
        static_configs:
        - targets: ['127.0.0.1:9100']


# ---------------------------------------------------- #
# ------------  PROCESSORS --------------------------- #
# ---------------------------------------------------- #
processors:
  batch:
    send_batch_size: 10000
    timeout: 5s


# ---------------------------------------------------- #
# -------------  EXPORTERS --------------------------- #
# ---------------------------------------------------- #
exporters:
  logging: null

# from all normal logs
  otlphttp/general:
    endpoint: https://foo.bar.com:443
    timeout: 10s
    retry_on_failure:
      enabled: true
      initial_interval: 10s
      max_interval: 60s
      max_elapsed_time: 10m
    sending_queue:
      enabled: true
      storage: file_storage/persistent_queue_storage
      queue_size: 1000000000
      num_consumers: 2
    compression: gzip
    tls:
      insecure: false


# ---------------------------------------------------- #
# ---------------  SERVICE --------------------------- #
# ---------------------------------------------------- #
service:
  extensions:
    - "file_storage/fingerprint"
    - "file_storage/persistent_queue_storage"

  telemetry:
    logs:
      encoding: "json"
      disable_caller: true

  pipelines:
    logs/syslog:
      receivers:
        - "filelog/syslog"
      processors:
        - "batch"
      exporters:
        - "otlphttp/general"

    metrics:
      receivers:
        - "prometheus"
      processors:
        - "batch"
      exporters:
        - "otlphttp/general"

Log output

Oct 21 17:46:01 gw-0210027-02 systemd[1]: Started OpenTelemetry Collector Contrib.
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: time="2024-10-21T17:46:01+01:00" level=warning msg="DBUS_SESSION_BUS_ADDRESS envvar looks to be not set, this can lead to runaway dbus-daemon processes. T>
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.5634177,"msg":"Setting up own telemetry..."}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.563445,"msg":"OpenCensus bridge is disabled for Collector telemetry and will be removed in a future version, use --feature>
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.5635238,"msg":"Serving metrics","address":":9103","metrics level":"Normal"}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.5654778,"msg":"Starting otelcol-contrib...","Version":"0.106.1","NumCPU":6}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.5654962,"msg":"Starting extensions..."}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.5655017,"msg":"Extension is starting...","kind":"extension","name":"file_storage/persistent_queue_storage"}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.5655134,"msg":"Extension started.","kind":"extension","name":"file_storage/persistent_queue_storage"}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.565517,"msg":"Extension is starting...","kind":"extension","name":"file_storage/fingerprint"}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.5655217,"msg":"Extension started.","kind":"extension","name":"file_storage/fingerprint"}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: panic: assertion failed: Page expected to be: 242, but self identifies as 8316213599780548146
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: goroutine 1 [running]:
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt._assert(...)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/db.go:1387
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*page).fastCheck(0x7f795bd4c000, 0xf2)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/page.go:57 +0x1d9
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Tx).page(0xfd3f188?, 0x7412180?)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/tx.go:534 +0x7b
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Tx).forEachPageInternal(0xc002450540, {0xc00225b3b0, 0x2, 0xa}, 0xc00353e020)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/tx.go:546 +0x5d
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Tx).forEachPageInternal(0xc002450540, {0xc00225b3b0, 0x1, 0xa}, 0xc00353e020)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/tx.go:555 +0xc8
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Tx).forEachPage(...)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/tx.go:542
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Tx).checkBucket(0xc002450540, 0xc00187db80, 0xc00353e428, 0xc00353e368, {0x9f50fa8, 0x10095f80}, 0xc002259b00)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/tx_check.go:83 +0x114
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Tx).checkBucket.func2({0x7f795bca6020?, 0xc00225b360?, 0xc00347fa78?})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/tx_check.go:110 +0x90
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Bucket).ForEachBucket(0x0?, 0xc00353e220)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/bucket.go:411 +0x96
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Tx).checkBucket(0xc002450540, 0xc002450558, 0xc00353e428, 0xc00353e368, {0x9f50fa8, 0x10095f80}, 0xc002259b00)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/tx_check.go:108 +0x255
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*DB).freepages(0xc002253448)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/db.go:1205 +0x225
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*DB).loadFreelist.func1()
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/db.go:417 +0xc5
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: sync.(*Once).doSlow(0x3fc8bc0?, 0xc002253610?)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         sync/once.go:74 +0xc2
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: sync.(*Once).Do(...)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         sync/once.go:65
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*DB).loadFreelist(0xc002253448?)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/db.go:413 +0x45
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.Open({0xc00225b2c0, 0x45}, 0x180, 0xc00353e650)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/db.go:295 +0x430
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/open-telemetry/opentelemetry-collector-contrib/extension/storage/filestorage.newClient(0xc001fbab00, {0xc00225b2c0?, 0xc00353e7c0?}, 0x2540be400, 0xc003315600, 0xc0?)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         github.com/open-telemetry/opentelemetry-collector-contrib/extension/storage/filestorage@v0.106.1/client.go:53 +0xaf
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/open-telemetry/opentelemetry-collector-contrib/extension/storage/filestorage.(*localFileStorage).GetClient(0xc001bfd980, {0x0?, 0x0?}, 0x0?, {{{0xc003349010?, 0x0?}}, {0xc003349019?, 0x0?}}, {0x8d9233f, 0x4})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         github.com/open-telemetry/opentelemetry-collector-contrib/extension/storage/filestorage@v0.106.1/extension.go:61 +0x3c5
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/exporter/internal/queue.toStorageClient({0x9f97628, 0x10095f80}, {{{0xc003243da0, 0xc}}, {0xc003243dad, 0x18}}, {0x9f40c68?, 0xc0034ae070?}, {{{0xc003349010, 0x8}}, ...}, ...)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.opentelemetry.io/collector/exporter@v0.106.1/internal/queue/persistent_queue.go:499 +0x106
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/exporter/internal/queue.(*persistentQueue[...]).Start(0xa02c220, {0x9f97628, 0x10095f80}, {0x9f40c68, 0xc0034ae070?})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.opentelemetry.io/collector/exporter@v0.106.1/internal/queue/persistent_queue.go:107 +0xa9
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start(0x9ee6610, {0x9f97628, 0x10095f80?}, {0x9f40c68?, 0xc0034ae070?})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.opentelemetry.io/collector/exporter@v0.106.1/internal/queue/consumers.go:31 +0x4c
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).Start(0xc0033680b0, {0x9f97628?, 0x10095f80?}, {0x9f40c68?, 0xc0034ae070?})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.opentelemetry.io/collector/exporter@v0.106.1/exporterhelper/queue_sender.go:103 +0x65
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).Start(0xc00337a160, {0x9f97628, 0x10095f80}, {0x9f40c68, 0xc0034ae070})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.opentelemetry.io/collector/exporter@v0.106.1/exporterhelper/common.go:322 +0xbf
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/service/internal/graph.(*Graph).StartAll(0xc00336a180, {0x9f97628, 0x10095f80}, {0x9f40c68, 0xc0034ae070}, {0x9f71d50, 0xc001ecd110})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.opentelemetry.io/collector/service@v0.106.1/internal/graph/graph.go:425 +0x1cd
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/service.(*Service).Start(0xc003442000, {0x9f97628, 0x10095f80})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.opentelemetry.io/collector/service@v0.106.1/service.go:217 +0x309
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/otelcol.(*Collector).setupConfigurationComponents(0xc002dd97a0, {0x9f97628, 0x10095f80})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.opentelemetry.io/collector/otelcol@v0.106.1/collector.go:218 +0x849
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/otelcol.(*Collector).Run(0xc002dd97a0, {0x9f97628, 0x10095f80})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.opentelemetry.io/collector/otelcol@v0.106.1/collector.go:272 +0x55
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/otelcol.NewCommand.func1(0xc00244fb08, {0x8d9a719?, 0x7?, 0x8d9255f?})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.opentelemetry.io/collector/otelcol@v0.106.1/command.go:37 +0xa7
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/spf13/cobra.(*Command).execute(0xc00244fb08, {0xc0001d4010, 0x1, 0x1})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         github.com/spf13/cobra@v1.8.1/command.go:985 +0xaca
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/spf13/cobra.(*Command).ExecuteC(0xc00244fb08)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         github.com/spf13/cobra@v1.8.1/command.go:1117 +0x3ff
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/spf13/cobra.(*Command).Execute(0x916ac28?)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         github.com/spf13/cobra@v1.8.1/command.go:1041 +0x13
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: main.runInteractive({0x916ac28, {{0x8dc9f6c, 0xf}, {0x8e71b88, 0x1f}, {0x8d9a5a6, 0x7}}, 0x0, {{{0x0, 0x0, ...}, ...}}, ...})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         github.com/open-telemetry/opentelemetry-collector-releases/contrib/main.go:53 +0x5d
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: main.run(...)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         github.com/open-telemetry/opentelemetry-collector-releases/contrib/main_others.go:10
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: main.main()
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         github.com/open-telemetry/opentelemetry-collector-releases/contrib/main.go:46 +0x205
Oct 21 17:46:01 gw-0210027-02 systemd[1]: otelcol-contrib.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Oct 21 17:46:01 gw-0210027-02 systemd[1]: otelcol-contrib.service: Failed with result 'exit-code'.
Oct 21 17:46:01 gw-0210027-02 systemd[1]: otelcol-contrib.service: Scheduled restart job, restart counter is at 5.
Oct 21 17:46:01 gw-0210027-02 systemd[1]: Stopped OpenTelemetry Collector Contrib.

Additional context

No response

@briandavis-viz briandavis-viz added bug Something isn't working needs triage New item requiring triage labels Oct 21, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@briandavis-viz
Copy link
Author

@djaglowski looks like you are the codeowner, is this something you can help with?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working extension/storage/filestorage needs triage New item requiring triage
Projects
None yet
Development

No branches or pull requests

1 participant