Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

future desirable: automatically add indexes to every column #10486

Closed
glycerine opened this issue Nov 6, 2016 · 10 comments
Closed

future desirable: automatically add indexes to every column #10486

glycerine opened this issue Nov 6, 2016 · 10 comments

Comments

@glycerine
Copy link
Contributor

glycerine commented Nov 6, 2016

// posting this my personal account... please delete/disregard #10485 which I mistaken posted from my work account.

A killer feature... which I predict would smash all resistance to wide adoption...

What if the cockroachdb user didn't have to predict in advance which columns will need indexes?

What if, instead, every column was just automatically indexed.

Current SQL databases require a long wait time to create a new index on an existing column that lacks one. As one's data gets big, this wait time gets very long, approaching un-usability. Why? because you do have to re-read basically all of your data.

Cockroachdb could add an -auto-indexing a flag, so that those who didn't want the extra storage overhead wouldn't pay for it. But for those who would benefit from the zero-wait time for fast searching against an existing column, this could be massively appealing.

Doesn't even seem too difficult to implement.

@maddyblue
Copy link
Contributor

maddyblue commented Nov 6, 2016

The problem with this is that it doesn't mean that all queries will use an index, because some queries filter/sort/join on multiple columns. Just having an index on a single column will indeed improve performance for some queries, but not all. So this could be dangerous because some users may not understand multi-column indexes and thus would expect that this feature would produce higher performance than reality.

The Google Datastore on App Engine is a good study of this. It, by default, indexes all columns (although you can turn that off per-column), and has some intelligence in their SDK that determines when a multi-column query is being executed and adds the appropriate index to be built. Doing something like this would make a bit more sense as it would achieve the expected performance gains. However it is much more difficult in CockroachDB than the Datastore because CockroachDB supports joins and more arbitrary queries than the Datastore.

@bdarnell
Copy link
Contributor

bdarnell commented Nov 6, 2016

Another problem is that each index has a cost beyond its storage - each insert needs to update all the indexes, which makes transactions touch more ranges.

Adding a column without an index is cheap, but adding a column with an index is expensive. So if every column gets an index, you've just made column additions as expensive as the index additions you're trying to avoid.

I think that support for analyzing query workloads (especially in development environments) and suggesting indexes to add would be helpful, but preemptively indexing everything is going to be too expensive.

@glycerine
Copy link
Contributor Author

glycerine commented Nov 6, 2016

preemptively indexing everything is going to be too expensive.

hmm... as a counter thought, is this not what full-text search engines, e.g. elasticsearch, are doing already? If they can do it, I don't see why it is a-priori too expensive... and sumo-logic claims this as a major selling point.

@glycerine
Copy link
Contributor Author

multi-column indexes

Fair point, but perfection is beating out the good in this case. "Better a diamond with a flaw than a pebble without."

@bdarnell
Copy link
Contributor

bdarnell commented Nov 6, 2016

is this not what full-text search engines, e.g. elasticsearch, are doing already?

Sort of, but full-text indexes are laid out differently than the indexes in CockroachDB and they're designed and optimized for this case from the ground up.

In any case, if you want a single-column index on every column, you can have this today in any database by simply adding the appropriate INDEX clause to all of your CREATE TABLE statements. You can try it and see how much it costs and how much of your indexing needs it satisfies. If it turns out to be useful in practice, we can consider ways of making this more automatic. In my experience, though, the majority of secondary indexes I've used have involved multiple columns.

@glycerine
Copy link
Contributor Author

glycerine commented Nov 6, 2016

full-text indexes are laid out differently than the indexes

I thought they were both monolithic sorted maps, sharded. just curious, what do you see as the differences?

adding the appropriate INDEX clause to all of your CREATE TABLE statements

Was really hoping to not have to pre-declare the schema either.

...but point well made. True too I can just make a sql table with two blob columns, a k and v table, with an index on k, and do the full-text logic in my own application. I thought there might be interest in the targeting the hybrid full-text applications space. Tongue in cheek, it's not like anybody has any use for full text search engines these days...

@bdarnell
Copy link
Contributor

bdarnell commented Nov 7, 2016

I thought they were both monolithic sorted maps, sharded. just curious, what do you see as the differences?

The difference is in how they are sharded. A typical full-text search system is sharded by document, so all the index data for a single document is in one place, and you have to query all shards to do a search but it's relatively cheap to add more terms to the query. In CockroachDB, indexing every column would scatter the data for a row across many index shards, and adding terms to the query makes it much more expensive as the data has to be pulled together from many places before it can be filtered.

Was really hoping to not have to pre-declare the schema either.

Ah, that's a different (and more interesting) case. We'd like to support json columns in the future (#2969), and it would make sense to offer new indexing strategies to go along with that.

We also want to provide full-text indexing of columns, but that's very different from just adding one of our current indexes to every column.

@glycerine
Copy link
Contributor Author

glycerine commented Nov 7, 2016

We also want to provide full-text indexing of columns

That is great to hear!

the index data for a single document is in one place

Ah. I would suppose this is so the process of taking the intersection of words in a phrase search is a local and not a distributed operation.

Seems reasonable, although having to query all shards for every search does seem super excessive.

More problematically, I don't see how this would map at all onto cockroach sharding where 64MB chunks can move anywhere at all. Is the cockroachdb sharding at odds with the traditional "sharding-by-doc" full-text search approach?

@bdarnell
Copy link
Contributor

bdarnell commented Nov 8, 2016

Well, you can still have full-text indexing where 64MB chunks of documents get moved around throughout the cluster, but yeah, full-text indexing is probably going to look fairly different from the indexing we currently do.

@glycerine
Copy link
Contributor Author

informative discussion, but no realistic feature request here. Closing it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants