-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix panic in vtgate when MakeRowTrusted gets bad result #4660
Conversation
Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
…lumns Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
9e54ced
to
3e09d5c
Compare
@@ -185,13 +185,14 @@ func ResultsEqual(r1, r2 []Result) bool { | |||
// Every place this function is called, a comment is needed that explains | |||
// why it's justified. | |||
func MakeRowTrusted(fields []*querypb.Field, row *querypb.Row) []Value { | |||
sqlRow := make([]Value, len(row.Lengths)) | |||
sqlRow := make([]Value, len(fields)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For further safety, we should use the min of both lengths, and then iterate up to that point. It's quite possible that a row is shorter than the fields received.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I was talking about this with @setassociative, I suggested that in that case it would mean that you are removing a column. This is not backwards compatible, so we shouldn't worry about this.
However, after more thought, I think what you are suggesting makes sense. It could be possible that you change all your app to not use a column to be deprecated and then run into this bug because of a select *
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on which shard you hit first. Let's say you started adding a column. If you hit the shard that has the new column first, and then go to an old shard, the rows will be shorter. If you go the other way round, the fields will be shorter.
This actually gives rise to another problem: If the rows are shorter, then the client may crash. So, the more correct fix will be to extend the result with null values to match the field length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes. I did not think about the client crash.
Adding a null value to match the fields, sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to avoid a large behavioral change here (so not using min-length or NULL padding) because I'm very uncomfortable about increasing the number of scenarios where we return bad data (as opposed to just trading off where we fail).
Leaning only on the field length means that we will still panic in any case where we get caught in an intermediary state of removing a field but most of the times returning data in that state (I think?) would be distinctly bad anyway. As an example take the following table states:
Before DDL
column name | type |
---|---|
id | INT NOT NULL |
name | VARCHAR NOT NULL |
address | VARCHAR DEFAULT NULL |
VARCHAR NOT NULL | |
age | INT DEFAULT NULL |
After DDL
column name | type |
---|---|
id | INT NOT NULL |
name | VARCHAR NOT NULL |
VARCHAR NOT NULL | |
age | INT DEFAULT NULL |
If we use the min(field, values) length to construct our result we will be passing back emails as addresses and NULL ages as (non-nullable) emails etc.
Padding with NULL values seems problematic for similar reasons.
In any case, I'm happy to swap over to min-length and/or NULL padding but wanted us to be explicitly okay with those failure modes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for correctness as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I dropped the thread on this one -- I wasn't sure that the scatter aggregation code didn't already handle this case so I wanted to verify before filing that as a separate issue. It sounds like we don't think it does though so I went ahead and filed #4669.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider ban SELECT *
altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly disagree that we should ban it as it would definitely discourage adoption.
I think being up front about the fact that during schema changes we might have situations where select * returns a transient error is acceptable, but banning it would be a step in the wrong direction.
Vitess or no vitess, schema changes and select *
make for an unpredictable situation, but I wouldn't want to discourage adoption by making people rewrite their whole app.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was talking to @setassociative IRL. I think that with the expectation that the system will heal itself and eventually converge (which seems like it's what will happen if we do the pragmatic change + returning error while you are in this state), there is no need to be too strict with SELECT *
.
It seems to me that instead of manufacturing null values or otherwise trying to return a semi-valid result in these cases, we should instead detect the row length mismatch at vtgate and simply return To Richard's point, if the rows don't share the same field signature, then the results are effectively indeterminate, so it seems better to explicitly tell the clients that something weird is happening than try to paper over the problem. |
Context
When a schema change is made to a tablet to add a new field VTTablet executions of
select *
may returnquerypb.QueryResult
objects that have an outdatedFields
attribute. This contradicts the docs which promise...As returned by Execute, len(fields) is always equal to len(row) (for each row in rows).
The
MakeRowTrusted
implementation relies on that promise that len(fields) === len(row.Lengths) so when we hit the case described above the vtgate panics trying to construct a query result.The stack trace that led us to this conclusion (I'd love alternate readings) follows:
Change
This PR doesn't impact the root cause (tracked as #4661) but it does protect the vtgate from panicking while handling this case. This is beneficial for several reasons:
A Caveat: That said, a new failure mode is introduced if a column is added in the middle of a table: it will now be returned instead of having the vtgate panic and shares the same issues as item 4 above.
Testing
Unit+integration tests pass (per
test.sh
driver in vagrant setup), added unit tests to cover the case that was previously panicking.