Implement Pilosa IndexDriver #174

ajnavarro · 2018-04-26T16:18:00Z

Create the pilosa IndexDriver implementation:

// IndexDriver manages the coordination between the indexes and their
// representation in disk.
type IndexDriver interface {
	// ID returns the unique name of the driver.
	ID() string
	// Create a new index.
	Create(path, table, db, id string, exprs []Expression) (Index, error)
	// Load the index at the given path.
	Load(path string) (Index, error)
	// Save the given index at the given path.
	Save(ctx context.Context, path string, index Index, iter IndexKeyValueIter) error
	// Delete the index with the given path.
	Delete(path string, index Index) error
}

And implement Create and Save Methods.

ajnavarro · 2018-05-11T13:22:20Z

Thinking a bit more about that, Load method should be Load(db, table string) ([]Index, error) instead of the previous method because there is no way to know before call Load method the indexes that we should load by ID. WDYT @kuba-- ?

kuba-- · 2018-05-11T14:06:31Z

@ajnavarro - If we wanna to load all indexes then your proposal makes sense.
We may also add List(db table) (ids []string) and leave id in Load, so we can load only particular ones or one by one, instead of everything in memory at once.
It really depends if we wanna load everything at the beginning.

kuba-- · 2018-05-15T21:58:37Z

Pilosa mapping - details

Pilosa indexes may be split into multiple frames. Every frame manages own bitmap.
For index implementation we create a frame for each expression, e.g.:
CREATE INDEX lang_idx ON tree_entries (name, hash)
We create one index tree_entries.lang_idx with two frames name and hash.
Moreover Pilosa index name and frame name have following restrictions:

maximum length is 64 characters
both should match regexp ^[a-z][a-z0-9_-]*$

In Pilosa we'll store many indexes for many tables (and maybe for many databases) , so if we want to have globally unique names we have to be careful. A frame name is an expression so special characters may happen.

The first type of mapping is needed for keys/names.

Index name -> SHA1(db + "." + table + "." + id)
Frame name -> SHA1(expression.String())
SHA1 hex is a fixed size (40 characters) hexadecimal string. This kind of mapping will not use any external/third-party storage.

The second type of mapping is needed for values. If we want to set a bit in Pilosa we have to specify row and column IDs.

As you can see in Pilosa bitmaps , indexes may have one or more frames and all column IDs are shared by all frames. But every frame has own enumeration of rows.
In our example we have one index and two frames (name, hash). Every distinct value will be mapped to the rowID in appropriate frame.

Expression + Value -> rowID
Expression determines a frame. Value determines rowID. If the value already exists (already was mapped) then we should use the same rowID, otherwise the next available number should be assigned.
We store this type of mapping in local, file KV storage.
Example Mapping

In the example we iterate through values.
The first tuple comes with value A for column name and value 123 for column hash. Because there is no row assigned to value A and it's the first tuple it's mapped to bit (0, 0) in FRAME NAME . The same for value 012 for column hash.
The next tuple is (B, 345). There is no row assigned to value B in FRAME NAME and it's a second tuple so we set bit (1, 1).
Now, in frame name we have A mapped to row 0 and B mapped to row 1 in FRAME NAME and 012 mapped to row 0 and 345 mapped to row 1 in FRAME HASH.
The third tuple is (A, 678). A was already assigned to row 0 in FRAME NAME, but hash 678 is a new one, so was assigned to the next available row (2) in FRAME HASH.
In other words, for third column (columnID == 2), we set bit 0 (because of A) in FRAME NAME and bit 2 in FRAME HASH

The first choice for KV storage was GitHub - dgraph-io/badger: Fast key-value DB in Go. but during implementation it turned out that GitHub - boltdb/bolt: An embedded key/value database for Go. works better in our case. Both are KV file DBs with many similarities, but:

In BadgerDB all keys are in global namespace. BoltDB has buckets (and sub buckets), so you can split keys across buckets. Just in case (if the same value happens for different expressions) we have to concatenate Expression with Value and map it to Row ID. With BadgerDB you do not have any isolation and have to figure out a good separator between Expression and Value, because if you want to list all Values for a given Expression you have to query storage with some prefix. Buckets gives us this kind of isolation. We can easily get all keys for a particular bucket and we do not need to concatenate strings, just Expression is a Bucket name and Value is a key inside that bucket.
BoltDB has bucket stats, so if we want to get Row ID or find a next available id, then we open a bucket, try to get a value, if value doesn't exist take a total number of keys in this bucket. BadgerDB does not have this kind of stats and cannot answer a question how many keys with prefix...". So, we need to iterate over all keys with certain prefix , compare keys, and if we don't find any,thing equal, then we assign loop counter as a next Row ID.
BoltDB has bucket cursor what in our case simplifies implementation of iterators. Badger has a similar concept of iterators, but basically it requires to Seek with prefix and check if next key has a Valid prefix

How mapping can be used in the Pilosa index driver

Create - no direct mapping, we only create a config file
Load - no direct mapping, we walk though directory and read config files
Save - map names (index and frames) and create (if not exists) a Pilosa index with frames, after that we iterate through all values and map them (inside a bucket) to Row ID and set a bit.
Delete - delete index directory (with config file and mapping db), map index name and try to delete the index from Pilosa server.
Create and Load return sql.Index interface which requires implementation of sql.IndexLookup. For lookup we use BoltDB bucket cursor to iterate over all [keys.](url)

ajnavarro · 2018-05-16T09:52:50Z

Index name -> SHA1(db + "." + table + "." + id)

The index name should be just SHA1(db + "." + table), right?

kuba-- · 2018-05-16T09:58:37Z

What if we have many indexes for the same table? That's why I appended id, as well.

ajnavarro · 2018-05-17T07:53:46Z

Maybe I'm wrong, but if we don't put all the database indexes per table on the same pilosa index we cannot join them on the future using Intersection, Union and Difference, right?

kuba-- · 2018-05-17T08:48:43Z

All logical operations like Intersection, Union, Difference take frames' bitmaps. So we can always create a new index as a union of multiple bitmaps, e.g.:

idx := schema.Index("old-name")
frm1 := idx.Frame("frame1")
frm2 := idx.Frame("frame2")

unionIdx := schema.Index("union-idx")
unionIdx.Union(frm1.Bitmap(0), frm2.Bitmap(0))

Does it make sense for us?

ajnavarro added this to the Index-1 milestone Apr 26, 2018

kuba-- added the enhancement New feature or request label May 7, 2018

kuba-- self-assigned this May 7, 2018

ajnavarro changed the title ~~IndexSaver pilosa index implementation~~ IndexDriver pilosa index implementation May 7, 2018

ajnavarro mentioned this issue May 7, 2018

IndexLoader pilosa index implementation #173

Closed

ajnavarro changed the title ~~IndexDriver pilosa index implementation~~ Implement IndexDriver Create and Save for pilosa index May 7, 2018

This comment has been minimized.

Sign in to view

This was referenced May 17, 2018

Make sql.Expression interface de/serializable... #194

Closed

Specify what sql. IndexKeyValueIter.Next() function returns #195

Closed

Get and Has functions should take multiple key parameters. #196

Closed

This was referenced May 22, 2018

Implement IndexDriver Load and Delete for pilosa index #184

Closed

Implement pilosa index simple interface (Index.Get(...)) #175

Closed

kuba-- mentioned this issue May 25, 2018

Implementation of Pilosa Index Driver #205

Merged

kuba-- added duplicate This issue or pull request already exists and removed duplicate This issue or pull request already exists labels May 30, 2018

kuba-- changed the title ~~Implement IndexDriver Create and Save for pilosa index~~ Implement Pilosa IndexDriver May 30, 2018

ajnavarro closed this as completed in #205 Jun 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Pilosa IndexDriver #174

Implement Pilosa IndexDriver #174

ajnavarro commented Apr 26, 2018 •

edited

Loading

This comment has been minimized.

ajnavarro commented May 11, 2018

kuba-- commented May 11, 2018 •

edited

Loading

kuba-- commented May 15, 2018 •

edited

Loading

ajnavarro commented May 16, 2018

kuba-- commented May 16, 2018

ajnavarro commented May 17, 2018 •

edited

Loading

kuba-- commented May 17, 2018

Implement Pilosa IndexDriver #174

Implement Pilosa IndexDriver #174

Comments

ajnavarro commented Apr 26, 2018 • edited Loading

This comment has been minimized.

ajnavarro commented May 11, 2018

kuba-- commented May 11, 2018 • edited Loading

kuba-- commented May 15, 2018 • edited Loading

Pilosa mapping - details

How mapping can be used in the Pilosa index driver

ajnavarro commented May 16, 2018

kuba-- commented May 16, 2018

ajnavarro commented May 17, 2018 • edited Loading

kuba-- commented May 17, 2018

ajnavarro commented Apr 26, 2018 •

edited

Loading

kuba-- commented May 11, 2018 •

edited

Loading

kuba-- commented May 15, 2018 •

edited

Loading

ajnavarro commented May 17, 2018 •

edited

Loading