Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] index-bboltdb: Nil Pointer at /go.etcd.io/bbolt/freelist.go:265 #7445

Open
liguozhong opened this issue Oct 18, 2022 · 4 comments
Open

Comments

@liguozhong
Copy link
Contributor

liguozhong commented Oct 18, 2022

Describe the bug
A clear and concise description of what the bug is.
from slack loki channel @xzyl

https://grafana.slack.com/archives/CEPJRLQNL/p1666062333990409

Hello, when I query from Grafana explore, loki-read pod got some errors:

level=error ts=2022-10-18T02:26:04.511028763Z caller=index_set.go:265 table-name=loki_index_19282 msg="sync failed, retrying it" err="recovered from panic opening boltdb file: runtime error: invalid memory address or nil pointer dereference"
level=error ts=2022-10-18T02:26:04.511080618Z caller=index_set.go:104 table-name=loki_index_19282 msg="failed to initialize table loki_index_19282, cleaning it up" err="recovered from panic opening boltdb file: runtime error: invalid memory address or nil pointer dereference"
level=error ts=2022-10-18T02:26:04.511107385Z caller=table.go:294 table-name=loki_index_19282 msg="failed to init user index set " err="recovered from panic opening boltdb file: runtime error: invalid memory address or nil pointer dereference"
level=error ts=2022-10-18T02:26:04.511149301Z caller=table.go:312 table-name=loki_index_19282 org_id=fake msg="index set  has some problem, cleaning it up" err="recovered from panic opening boltdb file: runtime error: invalid memory address or nil pointer dereference"
and no result returned.

I deploy loki with helm ( chart loki-simple-scalable-1.8.11, loki 2.6.1), and a S3-like storage.
How should I fix this? Thx in advance.

level=info ts=2022-10-18T03:51:57.892311095Z caller=util.go:85 table-name=loki_index_19282 file-name=loki-write-1-1666013405437251700-1666057500.gz msg="downloaded file" total_time=157.582818ms
panic: runtime error: invalid memory address or nil pointer dereference
goroutine 5130 [running]:
github.com/grafana/loki/pkg/storage/stores/shipper/util.logPanic({0x216af80, 0xc00076d710})
	/src/loki/pkg/storage/stores/shipper/util/util.go:95 +0x4d
github.com/grafana/loki/pkg/storage/stores/shipper/util.safeOpenBoltDbFile.func1()
	/src/loki/pkg/storage/stores/shipper/util/util.go:54 +0x51
panic({0x216af80, 0xc00076d710})
	/usr/local/go/src/runtime/panic.go:884 +0x212
go.etcd.io/bbolt.(*freelist).read(0x0?, 0x0?)
	/src/loki/vendor/go.etcd.io/bbolt/freelist.go:265 +0x26
go.etcd.io/bbolt.(*DB).loadFreelist.func1()
	/src/loki/vendor/go.etcd.io/bbolt/db.go:323 +0xb8
sync.(*Once).doSlow(0xc003529a88?, 0x10?)
	/usr/local/go/src/sync/once.go:74 +0xc2
sync.(*Once).Do(...)
	/usr/local/go/src/sync/once.go:65
go.etcd.io/bbolt.(*DB).loadFreelist(0xc0035298c0?)
	/src/loki/vendor/go.etcd.io/bbolt/db.go:316 +0x47
go.etcd.io/bbolt.Open({0xc00381c3c0, 0x5b}, 0x3ed3cf0?, 0xc001c7bf48)
	/src/loki/vendor/go.etcd.io/bbolt/db.go:293 +0x48b
github.com/grafana/loki/pkg/storage/chunk/client/local.OpenBoltdbFile(...)
	/src/loki/pkg/storage/chunk/client/local/boltdb_index_client.go:404
github.com/grafana/loki/pkg/storage/stores/shipper/util.safeOpenBoltDbFile({0xc00381c3c0, 0x5b}, 0xc00381c420)
	/src/loki/pkg/storage/stores/shipper/util/util.go:62 +0x10a
created by github.com/grafana/loki/pkg/storage/stores/shipper/util.SafeOpenBoltdbFile
	/src/loki/pkg/storage/stores/shipper/util/util.go:42 +0x98
level=error ts=2022-10-18T03:51:57.941516304Z caller=index_set.go:266 table-name=loki_index_19282 msg="sync failed, retrying it" err="recovered from panic opening boltdb file: runtime error: invalid memory address or nil pointer dereference"
level=info ts=2022-10-18T03:51:57.978115283Z caller=util.go:85 table-name=loki_index_19282 file-name=loki-write-1-1666013405437251700-1666050300.gz msg="downloaded file" total_time=14.854231ms
		db.freelist.read(db.page(db.meta().freelist))

nil pointer here.

p:= db.page(db.meta().freelist)
p==nil
p.flags panic

// loadFreelist reads the freelist if it is synced, or reconstructs it
// by scanning the DB if it is not synced. It assumes there are no
// concurrent accesses being made to the freelist.
func (db *DB) loadFreelist() {
	db.freelistLoad.Do(func() {
		db.freelist = newFreelist(db.FreelistType)
		if !db.hasSyncedFreelist() {
			// Reconstruct free list by scanning the DB.
			db.freelist.readIDs(db.freepages())
		} else {
			// Read free list from freelist page.
			db.freelist.read(db.page(db.meta().freelist))
		}
		db.stats.FreePageN = db.freelist.free_count()
	})
}

// read initializes the freelist from a freelist page.
func (f *freelist) read(p *page) {
	if (p.flags & freelistPageFlag) == 0 {
		panic(fmt.Sprintf("invalid freelist page: %d, page type is %s", p.id, p.typ()))
	}
}

To Reproduce
Steps to reproduce the behavior:

  1. Started Loki (SHA or version)
  2. Started Promtail (SHA or version) to tail '...'
  3. Query: {} term

Expected behavior

Environment:

  • Infrastructure: [e.g., Kubernetes, bare-metal, laptop]
  • Deployment tool: [e.g., helm, jsonnet]

Screenshots, Promtail config, or terminal output
If applicable, add any output to help explain your problem.

@liguozhong liguozhong changed the title [bbolt] Nil Pointer [bbolt] Nil Pointer :/go.etcd.io/bbolt/freelist.go:265 Oct 18, 2022
@liguozhong liguozhong changed the title [bbolt] Nil Pointer :/go.etcd.io/bbolt/freelist.go:265 [bug] index-bboltdb: Nil Pointer at /go.etcd.io/bbolt/freelist.go:265 Oct 18, 2022
@liguozhong
Copy link
Contributor Author

liguozhong commented Oct 18, 2022

another boltdb panic issue:
#2970.

this issue just delete the problem file,

so this bug:Is this due to
1: the wrong way we use boltdb ,
2: or is it due to a bug in the boltdb code itself,
3: or an operating system error,
4: or a disk error?

@liguozhong
Copy link
Contributor Author

ref pr:

boltdbshipper: fix crash of "timeout" panic when boltdbshipper do init
#3369
Loki: Revert reduce list operations and crash after timeout:
#3396

@liguozhong
Copy link
Contributor Author

Immediately after panic log, I see the logs related to cleaning up. It seems that the boltdb file that cannot be opened will be automatically deleted. But why the file won't open is still a mystery.

level=error ts=2022-10-18T03:51:58.121617218Z caller=index_set.go:266 table-name=loki_index_19282 msg="sync failed, retrying it" err="recovered from panic opening boltdb file: runtime error: invalid memory address or nil pointer dereference"
level=error ts=2022-10-18T03:51:58.121639936Z caller=index_set.go:101 table-name=loki_index_19282 msg="failed to initialize table loki_index_19282, cleaning it up" err="recovered from panic opening boltdb file: runtime error: invalid memory address or nil pointer dereference"
level=error ts=2022-10-18T03:51:58.121659348Z caller=table.go:294 table-name=loki_index_19282 msg="failed to init user index set " err="recovered from panic opening boltdb file: runtime error: invalid memory address or nil pointer dereference"
level=error ts=2022-10-18T03:51:58.12167318Z caller=table.go:312 table-name=loki_index_19282 org_id=fake msg="index set  has some problem, cleaning it up" err="recovered from panic opening boltdb file: runtime error: invalid memory address or nil pointer dereference"

@liguozhong
Copy link
Contributor Author

// Init downloads all the db files for the table from object storage.
func (t *indexSet) Init(forQuerying bool) (err error) {
	// Using background context to avoid cancellation of download when request times out.
	// We would anyways need the files for serving next requests.
	ctx, cancelFunc := context.WithTimeout(context.Background(), downloadTimeout)
	t.cancelFunc = cancelFunc

	logger := spanlogger.FromContextWithFallback(ctx, t.logger)

	defer func() {
		if err != nil {
			level.Error(t.logger).Log("msg", fmt.Sprintf("failed to initialize table %s, cleaning it up", t.tableName), "err", err)
			t.err = err

			// cleaning up files due to error to avoid returning invalid results.
			for fileName := range t.index {
				if err := t.cleanupDB(fileName); err != nil {
					level.Error(t.logger).Log("msg", "failed to cleanup partially downloaded file", "filename", fileName, "err", err)
				}
			}
		}
		t.cancelFunc()
		t.indexMtx.markReady()
	}()

	dirEntries, err := os.ReadDir(t.cacheLocation)
	if err != nil {
		return err
	}

	// open all the locally present files first to avoid downloading them again during sync operation below.
	for _, entry := range dirEntries {
		if entry.IsDir() {
			continue
		}

		fullPath := filepath.Join(t.cacheLocation, entry.Name())
		// if we fail to open an index file, lets skip it and let sync operation re-download the file from storage.
		idx, err := t.openIndexFileFunc(fullPath)
		if err != nil {
			level.Error(t.logger).Log("msg", fmt.Sprintf("failed to open existing index file %s, removing the file and continuing without it to let the sync operation catch up", fullPath), "err", err)
			// Sometimes files get corrupted when the process gets killed in the middle of a download operation which can cause problems in reading the file.
			// Implementation of openIndexFileFunc should take care of gracefully handling corrupted files.
			// Let us just remove the file and let the sync operation re-download it.
			if err := os.Remove(fullPath); err != nil {
				level.Error(t.logger).Log("msg", fmt.Sprintf("failed to remove index file %s which failed to open", fullPath))
			}
			continue
		}

		t.index[entry.Name()] = idx
	}

	level.Debug(logger).Log("msg", fmt.Sprintf("opened %d local files, now starting sync operation", len(t.index)))

	// sync the table to get new files and remove the deleted ones from storage.
	err = t.syncWithRetry(ctx, false, forQuerying)
	if err != nil {
		return
	}

	level.Debug(logger).Log("msg", "finished syncing files")

	return
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant