Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix panic in vtgate when MakeRowTrusted gets bad result #4660

Closed

Conversation

setassociative
Copy link

@setassociative setassociative commented Feb 22, 2019

Context

When a schema change is made to a tablet to add a new field VTTablet executions of select * may return querypb.QueryResult objects that have an outdated Fields attribute. This contradicts the docs which promise ...As returned by Execute, len(fields) is always equal to len(row) (for each row in rows).

The MakeRowTrusted implementation relies on that promise that len(fields) === len(row.Lengths) so when we hit the case described above the vtgate panics trying to construct a query result.

The stack trace that led us to this conclusion (I'd love alternate readings) follows:

go/src/runtime/panic.go:505
go/src/runtime/panic.go:28
src/vitess.io/vitess/go/sqltypes/result.go:194
src/vitess.io/vitess/go/sqltypes/proto3.go:79
src/vitess.io/vitess/go/sqltypes/proto3.go:108
src/vitess.io/vitess/go/vt/vttablet/grpctabletconn/conn.go:113
src/vitess.io/vitess/go/vt/vttablet/queryservice/wrapped.go:170
src/vitess.io/vitess/go/vt/vtgate/gateway/discoverygateway.go:322
src/vitess.io/vitess/go/vt/vtgate/gateway/discoverygateway.go:133
src/vitess.io/vitess/go/vt/vttablet/queryservice/wrapped.go:168
src/vitess.io/vitess/go/vt/vtgate/scatter_conn.go:210
src/vitess.io/vitess/go/vt/vtgate/scatter_conn.go:796
src/vitess.io/vitess/go/vt/vtgate/scatter_conn.go:796
src/vitess.io/vitess/go/vt/vtgate/scatter_conn.go:811
src/vitess.io/vitess/go/vt/vtgate/scatter_conn.go:187
src/vitess.io/vitess/go/vt/vtgate/vcursor_impl.go:162
src/vitess.io/vitess/go/vt/vtgate/engine/route.go:228
src/vitess.io/vitess/go/vt/vtgate/engine/route.go:188
src/vitess.io/vitess/go/vt/vtgate/executor.go:301
src/vitess.io/vitess/go/vt/vtgate/executor.go:175
src/vitess.io/vitess/go/vt/vtgate/executor.go:126
src/vitess.io/vitess/go/vt/vtgate/vtgate.go:288
src/vitess.io/vitess/go/vt/vtgate/plugin_mysql_server.go:154
src/vitess.io/vitess/go/mysql/conn.go:730
src/vitess.io/vitess/go/mysql/server.go:439
go/src/runtime/asm_amd64.s:2361

Change

This PR doesn't impact the root cause (tracked as #4661) but it does protect the vtgate from panicking while handling this case. This is beneficial for several reasons:

  1. Until such a time as the VTTablet is using an updated schema we'll continue to successfully serve requests using the "old" (pre-column-add) schema. This is essential for applications anyway because schema changes aren't atomically applied so they already already support this behavior;
  2. I believe that adding columns to a table is a more frequent change than removing columns and this means that the common-case is better supported without introducing any new failure modes;
  3. If a similar bug is exposed during column removal we'll hit the same panic; that's not ideal but it seems more safe than silently ignoring that (likely) we're encoding a value into an incorrectly typed column;
  4. In the case that columns are added and removed such that the net number of fields does not change this behaves the same as the previous case.

A Caveat: That said, a new failure mode is introduced if a column is added in the middle of a table: it will now be returned instead of having the vtgate panic and shares the same issues as item 4 above.

Testing

Unit+integration tests pass (per test.sh driver in vagrant setup), added unit tests to cover the case that was previously panicking.

Richard Bailey added 2 commits February 22, 2019 09:04
Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
…lumns

Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
@@ -185,13 +185,14 @@ func ResultsEqual(r1, r2 []Result) bool {
// Every place this function is called, a comment is needed that explains
// why it's justified.
func MakeRowTrusted(fields []*querypb.Field, row *querypb.Row) []Value {
sqlRow := make([]Value, len(row.Lengths))
sqlRow := make([]Value, len(fields))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For further safety, we should use the min of both lengths, and then iterate up to that point. It's quite possible that a row is shorter than the fields received.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was talking about this with @setassociative, I suggested that in that case it would mean that you are removing a column. This is not backwards compatible, so we shouldn't worry about this.

However, after more thought, I think what you are suggesting makes sense. It could be possible that you change all your app to not use a column to be deprecated and then run into this bug because of a select *.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on which shard you hit first. Let's say you started adding a column. If you hit the shard that has the new column first, and then go to an old shard, the rows will be shorter. If you go the other way round, the fields will be shorter.

This actually gives rise to another problem: If the rows are shorter, then the client may crash. So, the more correct fix will be to extend the result with null values to match the field length.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes. I did not think about the client crash.

Adding a null value to match the fields, sounds good.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to avoid a large behavioral change here (so not using min-length or NULL padding) because I'm very uncomfortable about increasing the number of scenarios where we return bad data (as opposed to just trading off where we fail).

Leaning only on the field length means that we will still panic in any case where we get caught in an intermediary state of removing a field but most of the times returning data in that state (I think?) would be distinctly bad anyway. As an example take the following table states:

Before DDL

column name type
id INT NOT NULL
name VARCHAR NOT NULL
address VARCHAR DEFAULT NULL
email VARCHAR NOT NULL
age INT DEFAULT NULL

After DDL

column name type
id INT NOT NULL
name VARCHAR NOT NULL
email VARCHAR NOT NULL
age INT DEFAULT NULL

If we use the min(field, values) length to construct our result we will be passing back emails as addresses and NULL ages as (non-nullable) emails etc.

Padding with NULL values seems problematic for similar reasons.

In any case, I'm happy to swap over to min-length and/or NULL padding but wanted us to be explicitly okay with those failure modes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for correctness as well.

Copy link
Author

@setassociative setassociative Feb 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I dropped the thread on this one -- I wasn't sure that the scatter aggregation code didn't already handle this case so I wanted to verify before filing that as a separate issue. It sounds like we don't think it does though so I went ahead and filed #4669.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider ban SELECT * altogether?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly disagree that we should ban it as it would definitely discourage adoption.

I think being up front about the fact that during schema changes we might have situations where select * returns a transient error is acceptable, but banning it would be a step in the wrong direction.

Vitess or no vitess, schema changes and select * make for an unpredictable situation, but I wouldn't want to discourage adoption by making people rewrite their whole app.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was talking to @setassociative IRL. I think that with the expectation that the system will heal itself and eventually converge (which seems like it's what will happen if we do the pragmatic change + returning error while you are in this state), there is no need to be too strict with SELECT *.

@demmer
Copy link
Member

demmer commented Feb 24, 2019

It seems to me that instead of manufacturing null values or otherwise trying to return a semi-valid result in these cases, we should instead detect the row length mismatch at vtgate and simply return ER_BAD_FIELD_ERROR with an appropriate error message.

To Richard's point, if the rows don't share the same field signature, then the results are effectively indeterminate, so it seems better to explicitly tell the clients that something weird is happening than try to paper over the problem.

@setassociative
Copy link
Author

Per commentary on PR I'm going to close in favor of actual root cause fixes in #4661 + #4669

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants