Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SHOW VITESS_SHARDS intermittently returns an empty list of shards #5038

Closed
tirsen opened this issue Jul 30, 2019 · 6 comments · Fixed by #5189
Closed

SHOW VITESS_SHARDS intermittently returns an empty list of shards #5038

tirsen opened this issue Jul 30, 2019 · 6 comments · Fixed by #5189

Comments

@tirsen
Copy link
Collaborator

tirsen commented Jul 30, 2019

SHOW VITESS_SHARDS intermittently returns an empty list of shards

I think the issue can be traced down to these lines of code at go/vt/vtgate/executor.go:800:

			_, _, shards, err := e.resolver.resolver.GetKeyspaceShards(ctx, keyspace, destTabletType)
			if err != nil {
				// There might be a misconfigured keyspace or no shards in the keyspace.
				// Skip any errors and move on.
				continue
			}

As you can see errors are ignored. So if there is for example an intermittent connectivity problem with the topology server then we would return an empty list rather than signaling an error.

I think the correct solution is to catch and ignore specific errors but in general pass the errors on.

@tirsen
Copy link
Collaborator Author

tirsen commented Jul 31, 2019

I think the main issue is that this hits Zookeeper and we do run SHOW VITESS_SHARDS as part of every single request for one of our apps. This means we send some pretty heavy qps to Zookeeper and sometimes it just can't keep up.

We should ideally be caching this instead.

@tirsen
Copy link
Collaborator Author

tirsen commented Aug 1, 2019

Actually following the codepath even further I can see now that this does not hit Zookeeper directly for every request. GetKeyspaceShards uses ResilientServer.GetSrvKeyspace which uses a local cache and a Zookeeper watcher to keep it up to date.

I think that's great but it doesn't explain why sometimes SHOW VITESS_SHARDS return 0. Will keep digging.

@aquarapid
Copy link
Contributor

I can reproduce this locally easily against a standalone vtgate and the example DB:

$ cat test_issue_5038.py 
import MySQLdb
db = MySQLdb.connect("127.0.0.1","root",db="commerce",port=15306)
cursor = db.cursor()

for i in xrange(0,10000):
  cursor.execute("SHOW VITESS_SHARDS;")
  data = cursor.fetchone()
  try:
    if data[0] != "commerce/0":
      print data
  except:
    print data, i

db.close()

Running this gives, random failures, typically around 2 per 10000 selects on my machine, but it seems somewhat random, e.g.:

$ python test_issue_5038.py 
None 5671
None 8554

I will dig further.

@aquarapid
Copy link
Contributor

This is definitely related to the ResilientServer cache refresh. If I bump the vtgate srv_topo_cache_ttl to something high (say 60s); I can run the testcase without any failures.

@aquarapid
Copy link
Contributor

So, there's a race in the default configuration, even when ZK isn't being slow, because we have the TTL and refresh period set to the same value (1 second). One solution is to make sure the refresh period is lower than the TTL (e.g. TTL 2s, leave refresh as default of 1s). We could consider changing the defaults to this. Another fix would be to eliminate the race by resetting the entry.lastQueryTime in resilient_server.go to the entry.insertionTime when the topology has been refreshed.

@tirsen
Copy link
Collaborator Author

tirsen commented Sep 11, 2019

Having a fix to this would be really nice. We're using a workaround for this but it's quite unsatisfactory and will likely cause issues around shard splits.

aquarapid added a commit to planetscale/vitess that referenced this issue Feb 26, 2020
…uested, instead of zeroing it out.

Signed-off-by: Jacques Grove <aquarapid@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants