Skip to content
This repository has been archived by the owner on Jan 28, 2021. It is now read-only.

Implement Pilosa IndexDriver #174

Closed
ajnavarro opened this issue Apr 26, 2018 · 8 comments · Fixed by #205
Closed

Implement Pilosa IndexDriver #174

ajnavarro opened this issue Apr 26, 2018 · 8 comments · Fixed by #205
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@ajnavarro
Copy link
Contributor

ajnavarro commented Apr 26, 2018

Create the pilosa IndexDriver implementation:

// IndexDriver manages the coordination between the indexes and their
// representation in disk.
type IndexDriver interface {
	// ID returns the unique name of the driver.
	ID() string
	// Create a new index.
	Create(path, table, db, id string, exprs []Expression) (Index, error)
	// Load the index at the given path.
	Load(path string) (Index, error)
	// Save the given index at the given path.
	Save(ctx context.Context, path string, index Index, iter IndexKeyValueIter) error
	// Delete the index with the given path.
	Delete(path string, index Index) error
}

And implement Create and Save Methods.

@ajnavarro ajnavarro added this to the Index-1 milestone Apr 26, 2018
@kuba-- kuba-- added the enhancement New feature or request label May 7, 2018
@kuba-- kuba-- self-assigned this May 7, 2018
@ajnavarro ajnavarro changed the title IndexSaver pilosa index implementation IndexDriver pilosa index implementation May 7, 2018
@ajnavarro ajnavarro changed the title IndexDriver pilosa index implementation Implement IndexDriver Create and Save for pilosa index May 7, 2018
@kuba--

This comment has been minimized.

@ajnavarro
Copy link
Contributor Author

Thinking a bit more about that, Load method should be Load(db, table string) ([]Index, error) instead of the previous method because there is no way to know before call Load method the indexes that we should load by ID. WDYT @kuba-- ?

@kuba--
Copy link
Contributor

kuba-- commented May 11, 2018

@ajnavarro - If we wanna to load all indexes then your proposal makes sense.
We may also add List(db table) (ids []string) and leave id in Load, so we can load only particular ones or one by one, instead of everything in memory at once.
It really depends if we wanna load everything at the beginning.

@kuba--
Copy link
Contributor

kuba-- commented May 15, 2018

Pilosa mapping - details

Pilosa indexes may be split into multiple frames. Every frame manages own bitmap.
For index implementation we create a frame for each expression, e.g.:
CREATE INDEX lang_idx ON tree_entries (name, hash)
We create one index tree_entries.lang_idx with two frames name and hash.
Moreover Pilosa index name and frame name have following restrictions:

  • maximum length is 64 characters
  • both should match regexp ^[a-z][a-z0-9_-]*$

In Pilosa we'll store many indexes for many tables (and maybe for many databases) , so if we want to have globally unique names we have to be careful. A frame name is an expression so special characters may happen.

The first type of mapping is needed for keys/names.

  • Index name -> SHA1(db + "." + table + "." + id)
  • Frame name -> SHA1(expression.String())
    SHA1 hex is a fixed size (40 characters) hexadecimal string. This kind of mapping will not use any external/third-party storage.

The second type of mapping is needed for values. If we want to set a bit in Pilosa we have to specify row and column IDs.

As you can see in Pilosa bitmaps , indexes may have one or more frames and all column IDs are shared by all frames. But every frame has own enumeration of rows.
In our example we have one index and two frames (name, hash). Every distinct value will be mapped to the rowID in appropriate frame.

  • Expression + Value -> rowID
    Expression determines a frame. Value determines rowID. If the value already exists (already was mapped) then we should use the same rowID, otherwise the next available number should be assigned.
    We store this type of mapping in local, file KV storage.
    Example Mapping

In the example we iterate through values.
The first tuple comes with value A for column name and value 123 for column hash. Because there is no row assigned to value A and it's the first tuple it's mapped to bit (0, 0) in FRAME NAME . The same for value 012 for column hash.
The next tuple is (B, 345). There is no row assigned to value B in FRAME NAME and it's a second tuple so we set bit (1, 1).
Now, in frame name we have A mapped to row 0 and B mapped to row 1 in FRAME NAME and 012 mapped to row 0 and 345 mapped to row 1 in FRAME HASH.
The third tuple is (A, 678). A was already assigned to row 0 in FRAME NAME, but hash 678 is a new one, so was assigned to the next available row (2) in FRAME HASH.
In other words, for third column (columnID == 2), we set bit 0 (because of A) in FRAME NAME and bit 2 in FRAME HASH

The first choice for KV storage was GitHub - dgraph-io/badger: Fast key-value DB in Go. but during implementation it turned out that GitHub - boltdb/bolt: An embedded key/value database for Go. works better in our case. Both are KV file DBs with many similarities, but:

  • In BadgerDB all keys are in global namespace. BoltDB has buckets (and sub buckets), so you can split keys across buckets. Just in case (if the same value happens for different expressions) we have to concatenate Expression with Value and map it to Row ID. With BadgerDB you do not have any isolation and have to figure out a good separator between Expression and Value, because if you want to list all Values for a given Expression you have to query storage with some prefix. Buckets gives us this kind of isolation. We can easily get all keys for a particular bucket and we do not need to concatenate strings, just Expression is a Bucket name and Value is a key inside that bucket.
  • BoltDB has bucket stats, so if we want to get Row ID or find a next available id, then we open a bucket, try to get a value, if value doesn't exist take a total number of keys in this bucket. BadgerDB does not have this kind of stats and cannot answer a question how many keys with prefix...". So, we need to iterate over all keys with certain prefix , compare keys, and if we don't find any,thing equal, then we assign loop counter as a next Row ID.
  • BoltDB has bucket cursor what in our case simplifies implementation of iterators. Badger has a similar concept of iterators, but basically it requires to Seek with prefix and check if next key has a Valid prefix

How mapping can be used in the Pilosa index driver

  • Create - no direct mapping, we only create a config file
  • Load - no direct mapping, we walk though directory and read config files
  • Save - map names (index and frames) and create (if not exists) a Pilosa index with frames, after that we iterate through all values and map them (inside a bucket) to Row ID and set a bit.
  • Delete - delete index directory (with config file and mapping db), map index name and try to delete the index from Pilosa server.
  • Create and Load return sql.Index interface which requires implementation of sql.IndexLookup. For lookup we use BoltDB bucket cursor to iterate over all [keys.](url)

@ajnavarro
Copy link
Contributor Author

Index name -> SHA1(db + "." + table + "." + id)

The index name should be just SHA1(db + "." + table), right?

@kuba--
Copy link
Contributor

kuba-- commented May 16, 2018

What if we have many indexes for the same table? That's why I appended id, as well.

@ajnavarro
Copy link
Contributor Author

ajnavarro commented May 17, 2018

Maybe I'm wrong, but if we don't put all the database indexes per table on the same pilosa index we cannot join them on the future using Intersection, Union and Difference, right?

@kuba--
Copy link
Contributor

kuba-- commented May 17, 2018

All logical operations like Intersection, Union, Difference take frames' bitmaps. So we can always create a new index as a union of multiple bitmaps, e.g.:

idx := schema.Index("old-name")
frm1 := idx.Frame("frame1")
frm2 := idx.Frame("frame2")

unionIdx := schema.Index("union-idx")
unionIdx.Union(frm1.Bitmap(0), frm2.Bitmap(0))

Does it make sense for us?

@kuba-- kuba-- added duplicate This issue or pull request already exists and removed duplicate This issue or pull request already exists labels May 30, 2018
@kuba-- kuba-- changed the title Implement IndexDriver Create and Save for pilosa index Implement Pilosa IndexDriver May 30, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants