-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
future desirable: automatically add indexes to every column #10486
Comments
The problem with this is that it doesn't mean that all queries will use an index, because some queries filter/sort/join on multiple columns. Just having an index on a single column will indeed improve performance for some queries, but not all. So this could be dangerous because some users may not understand multi-column indexes and thus would expect that this feature would produce higher performance than reality. The Google Datastore on App Engine is a good study of this. It, by default, indexes all columns (although you can turn that off per-column), and has some intelligence in their SDK that determines when a multi-column query is being executed and adds the appropriate index to be built. Doing something like this would make a bit more sense as it would achieve the expected performance gains. However it is much more difficult in CockroachDB than the Datastore because CockroachDB supports joins and more arbitrary queries than the Datastore. |
Another problem is that each index has a cost beyond its storage - each insert needs to update all the indexes, which makes transactions touch more ranges. Adding a column without an index is cheap, but adding a column with an index is expensive. So if every column gets an index, you've just made column additions as expensive as the index additions you're trying to avoid. I think that support for analyzing query workloads (especially in development environments) and suggesting indexes to add would be helpful, but preemptively indexing everything is going to be too expensive. |
hmm... as a counter thought, is this not what full-text search engines, e.g. elasticsearch, are doing already? If they can do it, I don't see why it is a-priori too expensive... and sumo-logic claims this as a major selling point. |
Fair point, but perfection is beating out the good in this case. "Better a diamond with a flaw than a pebble without." |
Sort of, but full-text indexes are laid out differently than the indexes in CockroachDB and they're designed and optimized for this case from the ground up. In any case, if you want a single-column index on every column, you can have this today in any database by simply adding the appropriate INDEX clause to all of your CREATE TABLE statements. You can try it and see how much it costs and how much of your indexing needs it satisfies. If it turns out to be useful in practice, we can consider ways of making this more automatic. In my experience, though, the majority of secondary indexes I've used have involved multiple columns. |
I thought they were both monolithic sorted maps, sharded. just curious, what do you see as the differences?
Was really hoping to not have to pre-declare the schema either. ...but point well made. True too I can just make a sql table with two blob columns, a k and v table, with an index on k, and do the full-text logic in my own application. I thought there might be interest in the targeting the hybrid full-text applications space. Tongue in cheek, it's not like anybody has any use for full text search engines these days... |
The difference is in how they are sharded. A typical full-text search system is sharded by document, so all the index data for a single document is in one place, and you have to query all shards to do a search but it's relatively cheap to add more terms to the query. In CockroachDB, indexing every column would scatter the data for a row across many index shards, and adding terms to the query makes it much more expensive as the data has to be pulled together from many places before it can be filtered.
Ah, that's a different (and more interesting) case. We'd like to support json columns in the future (#2969), and it would make sense to offer new indexing strategies to go along with that. We also want to provide full-text indexing of columns, but that's very different from just adding one of our current indexes to every column. |
That is great to hear!
Ah. I would suppose this is so the process of taking the intersection of words in a phrase search is a local and not a distributed operation. Seems reasonable, although having to query all shards for every search does seem super excessive. More problematically, I don't see how this would map at all onto cockroach sharding where 64MB chunks can move anywhere at all. Is the cockroachdb sharding at odds with the traditional "sharding-by-doc" full-text search approach? |
Well, you can still have full-text indexing where 64MB chunks of documents get moved around throughout the cluster, but yeah, full-text indexing is probably going to look fairly different from the indexing we currently do. |
informative discussion, but no realistic feature request here. Closing it out. |
// posting this my personal account... please delete/disregard #10485 which I mistaken posted from my work account.
A killer feature... which I predict would smash all resistance to wide adoption...
What if the cockroachdb user didn't have to predict in advance which columns will need indexes?
What if, instead, every column was just automatically indexed.
Current SQL databases require a long wait time to create a new index on an existing column that lacks one. As one's data gets big, this wait time gets very long, approaching un-usability. Why? because you do have to re-read basically all of your data.
Cockroachdb could add an -auto-indexing a flag, so that those who didn't want the extra storage overhead wouldn't pay for it. But for those who would benefit from the zero-wait time for fast searching against an existing column, this could be massively appealing.
Doesn't even seem too difficult to implement.
The text was updated successfully, but these errors were encountered: