Fix panic in vtgate when MakeRowTrusted gets bad result #4660

setassociative · 2019-02-22T21:54:36Z

Context

When a schema change is made to a tablet to add a new field VTTablet executions of select * may return querypb.QueryResult objects that have an outdated Fields attribute. This contradicts the docs which promise ...As returned by Execute, len(fields) is always equal to len(row) (for each row in rows).

The MakeRowTrusted implementation relies on that promise that len(fields) === len(row.Lengths) so when we hit the case described above the vtgate panics trying to construct a query result.

The stack trace that led us to this conclusion (I'd love alternate readings) follows:

go/src/runtime/panic.go:505
go/src/runtime/panic.go:28
src/vitess.io/vitess/go/sqltypes/result.go:194
src/vitess.io/vitess/go/sqltypes/proto3.go:79
src/vitess.io/vitess/go/sqltypes/proto3.go:108
src/vitess.io/vitess/go/vt/vttablet/grpctabletconn/conn.go:113
src/vitess.io/vitess/go/vt/vttablet/queryservice/wrapped.go:170
src/vitess.io/vitess/go/vt/vtgate/gateway/discoverygateway.go:322
src/vitess.io/vitess/go/vt/vtgate/gateway/discoverygateway.go:133
src/vitess.io/vitess/go/vt/vttablet/queryservice/wrapped.go:168
src/vitess.io/vitess/go/vt/vtgate/scatter_conn.go:210
src/vitess.io/vitess/go/vt/vtgate/scatter_conn.go:796
src/vitess.io/vitess/go/vt/vtgate/scatter_conn.go:796
src/vitess.io/vitess/go/vt/vtgate/scatter_conn.go:811
src/vitess.io/vitess/go/vt/vtgate/scatter_conn.go:187
src/vitess.io/vitess/go/vt/vtgate/vcursor_impl.go:162
src/vitess.io/vitess/go/vt/vtgate/engine/route.go:228
src/vitess.io/vitess/go/vt/vtgate/engine/route.go:188
src/vitess.io/vitess/go/vt/vtgate/executor.go:301
src/vitess.io/vitess/go/vt/vtgate/executor.go:175
src/vitess.io/vitess/go/vt/vtgate/executor.go:126
src/vitess.io/vitess/go/vt/vtgate/vtgate.go:288
src/vitess.io/vitess/go/vt/vtgate/plugin_mysql_server.go:154
src/vitess.io/vitess/go/mysql/conn.go:730
src/vitess.io/vitess/go/mysql/server.go:439
go/src/runtime/asm_amd64.s:2361

Change

This PR doesn't impact the root cause (tracked as #4661) but it does protect the vtgate from panicking while handling this case. This is beneficial for several reasons:

Until such a time as the VTTablet is using an updated schema we'll continue to successfully serve requests using the "old" (pre-column-add) schema. This is essential for applications anyway because schema changes aren't atomically applied so they already already support this behavior;
I believe that adding columns to a table is a more frequent change than removing columns and this means that the common-case is better supported without introducing any new failure modes;
If a similar bug is exposed during column removal we'll hit the same panic; that's not ideal but it seems more safe than silently ignoring that (likely) we're encoding a value into an incorrectly typed column;
In the case that columns are added and removed such that the net number of fields does not change this behaves the same as the previous case.

A Caveat: That said, a new failure mode is introduced if a column is added in the middle of a table: it will now be returned instead of having the vtgate panic and shares the same issues as item 4 above.

Testing

Unit+integration tests pass (per test.sh driver in vagrant setup), added unit tests to cover the case that was previously panicking.

Signed-off-by: Richard Bailey <rbailey@slack-corp.com>

…lumns Signed-off-by: Richard Bailey <rbailey@slack-corp.com>

sougou · 2019-02-22T23:33:28Z

go/sqltypes/result.go

@@ -185,13 +185,14 @@ func ResultsEqual(r1, r2 []Result) bool {
 // Every place this function is called, a comment is needed that explains
 // why it's justified.
 func MakeRowTrusted(fields []*querypb.Field, row *querypb.Row) []Value {
-	sqlRow := make([]Value, len(row.Lengths))
+	sqlRow := make([]Value, len(fields))


For further safety, we should use the min of both lengths, and then iterate up to that point. It's quite possible that a row is shorter than the fields received.

When I was talking about this with @setassociative, I suggested that in that case it would mean that you are removing a column. This is not backwards compatible, so we shouldn't worry about this.

However, after more thought, I think what you are suggesting makes sense. It could be possible that you change all your app to not use a column to be deprecated and then run into this bug because of a select *.

It depends on which shard you hit first. Let's say you started adding a column. If you hit the shard that has the new column first, and then go to an old shard, the rows will be shorter. If you go the other way round, the fields will be shorter.

This actually gives rise to another problem: If the rows are shorter, then the client may crash. So, the more correct fix will be to extend the result with null values to match the field length.

Oh yes. I did not think about the client crash.

Adding a null value to match the fields, sounds good.

I was trying to avoid a large behavioral change here (so not using min-length or NULL padding) because I'm very uncomfortable about increasing the number of scenarios where we return bad data (as opposed to just trading off where we fail).

Leaning only on the field length means that we will still panic in any case where we get caught in an intermediary state of removing a field but most of the times returning data in that state (I think?) would be distinctly bad anyway. As an example take the following table states:

Before DDL

column name type

id INT NOT NULL

name VARCHAR NOT NULL

address VARCHAR DEFAULT NULL

email VARCHAR NOT NULL

age INT DEFAULT NULL

After DDL

column name type

id INT NOT NULL

name VARCHAR NOT NULL

email VARCHAR NOT NULL

age INT DEFAULT NULL

If we use the min(field, values) length to construct our result we will be passing back emails as addresses and NULL ages as (non-nullable) emails etc.

Padding with NULL values seems problematic for similar reasons.

In any case, I'm happy to swap over to min-length and/or NULL padding but wanted us to be explicitly okay with those failure modes.

I vote for correctness as well.

Sorry, I dropped the thread on this one -- I wasn't sure that the scatter aggregation code didn't already handle this case so I wanted to verify before filing that as a separate issue. It sounds like we don't think it does though so I went ahead and filed #4669.

Should we consider ban SELECT * altogether?

I strongly disagree that we should ban it as it would definitely discourage adoption.

I think being up front about the fact that during schema changes we might have situations where select * returns a transient error is acceptable, but banning it would be a step in the wrong direction.

Vitess or no vitess, schema changes and select * make for an unpredictable situation, but I wouldn't want to discourage adoption by making people rewrite their whole app.

I was talking to @setassociative IRL. I think that with the expectation that the system will heal itself and eventually converge (which seems like it's what will happen if we do the pragmatic change + returning error while you are in this state), there is no need to be too strict with SELECT *.

demmer · 2019-02-24T17:23:43Z

It seems to me that instead of manufacturing null values or otherwise trying to return a semi-valid result in these cases, we should instead detect the row length mismatch at vtgate and simply return ER_BAD_FIELD_ERROR with an appropriate error message.

To Richard's point, if the rows don't share the same field signature, then the results are effectively indeterminate, so it seems better to explicitly tell the clients that something weird is happening than try to paper over the problem.

setassociative · 2019-02-26T17:05:07Z

Per commentary on PR I'm going to close in favor of actual root cause fixes in #4661 + #4669

Richard Bailey added 2 commits February 22, 2019 09:04

Update MakeRowTrusted to lean on known fields

7ddd6db

Signed-off-by: Richard Bailey <rbailey@slack-corp.com>

don't mix reliance on row.Lengths and fields

0b4fa63

Signed-off-by: Richard Bailey <rbailey@slack-corp.com>

setassociative requested a review from sougou as a code owner February 22, 2019 21:54

Added unit test to cover MakeRowTrusted in the case where fields < co…

3e09d5c

…lumns Signed-off-by: Richard Bailey <rbailey@slack-corp.com>

setassociative force-pushed the sa/makerowtrusted-patch branch from 9e54ced to 3e09d5c Compare February 22, 2019 21:56

setassociative mentioned this pull request Feb 22, 2019

VTTablet query.Result contains inconsistent field/values #4661

Closed

setassociative closed this Feb 22, 2019

setassociative reopened this Feb 22, 2019

sougou suggested changes Feb 22, 2019

View reviewed changes

setassociative mentioned this pull request Feb 25, 2019

When handling a scatter query vtgate should ensure consistent schema #4669

Open

setassociative closed this Feb 26, 2019

deepthi mentioned this pull request Jan 21, 2021

fix query panic when fields and row length is not matching #7325

Closed

8 tasks

tanjinx mentioned this pull request Jul 29, 2022

Slack vitess 12.0.5 cherrypicks slackhq/vitess#1

Merged

3 tasks

tanjinx mentioned this pull request Nov 29, 2022

patch upstream pr 4660 (Fix panic in vtgate when MakeRowTrusted gets bad result) slackhq/vitess#21

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix panic in vtgate when MakeRowTrusted gets bad result #4660

Fix panic in vtgate when MakeRowTrusted gets bad result #4660

setassociative commented Feb 22, 2019 •

edited

Loading

sougou Feb 22, 2019

rafael Feb 23, 2019

sougou Feb 23, 2019

rafael Feb 23, 2019

setassociative Feb 24, 2019

deepthi Feb 25, 2019

setassociative Feb 25, 2019 •

edited

Loading

rafael Feb 25, 2019

demmer Feb 25, 2019

rafael Feb 25, 2019

demmer commented Feb 24, 2019

setassociative commented Feb 26, 2019

column name	type
id	INT NOT NULL
name	VARCHAR NOT NULL
address	VARCHAR DEFAULT NULL
email	VARCHAR NOT NULL
age	INT DEFAULT NULL

Fix panic in vtgate when MakeRowTrusted gets bad result #4660

Fix panic in vtgate when MakeRowTrusted gets bad result #4660

Conversation

setassociative commented Feb 22, 2019 • edited Loading

Context

Change

Testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

setassociative Feb 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

demmer commented Feb 24, 2019

setassociative commented Feb 26, 2019

setassociative commented Feb 22, 2019 •

edited

Loading

setassociative Feb 25, 2019 •

edited

Loading