-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[verifier] Add 'run-determinism-analysis-on-test' property to VerifierConfig #24492
base: master
Are you sure you want to change the base?
[verifier] Add 'run-determinism-analysis-on-test' property to VerifierConfig #24492
Conversation
i'm confused about what the problem is and how this fixes it. if you have something that deterministically returns some result in control and non-deterministically returns another result in test, then you've introduced a bug. Wouldn't this hide those issues? |
@rschlussel This, however, will not happen often and is a calculated risk we can take with Presto C++ to reduce the compute wasted/occupied in production environment by verification. The initial problem is when specifying test cluster as helper we have ALL queries with different results marked as non-deterministic, because running duplicates on test and always comparing with control's checksum. Did this answer your concern/question? |
Discussed offline with @rschlussel Decided to have two stage determinism analysis in our case:
If the flag is not specified we act as before - run single determinism analysis on helper with control checksum. |
f913c56
to
95c7b88
Compare
@rschlussel Updated the PR. |
return determinismAnalyzer.analyze(control, matchResult.getControlChecksum(), getControlAction()); | ||
} | ||
// Default behavior - we run determinism analysis, which uses helper action with control checksum. | ||
return determinismAnalyzer.analyze(control, matchResult.getControlChecksum(), null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it makes sense to ever use the "helper" action here. if you want to check if something is deterministic, you need to run it in the original environment. We should change the behavior and always pass the control action. @singcha will this break anything for presto-on-spark verification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rschlussel
I understand what you are saying and was quite surprised to see that we use helper action to do determinism analysis.
However, this change is to allow less compute on the control side rather than to fix the determinism analysis.
I would like to keep this change in its scope and avoid changing anything outside of it.
Can we follow this PR up with moving determinism analysis to control by default?
presto-verifier/src/main/java/com/facebook/presto/verifier/framework/DataVerification.java
Outdated
Show resolved
Hide resolved
{ | ||
// In case the action is not specified, use the default one we were constructed with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it would make sense to remove the queryAction field from the class and then always pass in the queryAction here (never let it be null).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rschlussel
We will still need HelperAction to run checksum and other types of helper queries.
I wonder if we can refactor this in the followup PR?
This PR would really help us unload prod clusters sooner and not break any existing use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you do it in a second commit in this PR? That way revert is easier if something breaks with the refactoring, but if nothing breaks, we leave things in a better state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rschlussel
Actually, we might want to leave the queryAction
field, but rename it to the helperAction
, because we want to use it to run checksums regardless of using control or test for determinism analysis.
But we enforce overrideActionForQuery
(rename it to queryAction
) to be non-null and use it for issuing the query itself, like you are proposing.
Sounds good?
I'll try adding most of this in the second commit.
presto-verifier/src/main/java/com/facebook/presto/verifier/framework/VerifierConfig.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, would be good to add a test.
@rschlussel |
I would take a similar approach to TestLimitQueryDeterminismAnalyzer. Mock out the Test and control PrestoActions. Then for some fake queries, if test result is non-deterministic (e.g. the mock action increments results by 1 each time you call it) and control is deterministic, result is non-deterministic. If both are non-deterministic, then the result is non-deterministic. |
Isn't verifier queries are capped within its own resource groups? That means, even if we send more queries in verifier resource group, it will be queued and won't hurt production queries as they have separate queue. From our production issues, the queueing problems are not directly releated to verification, but a few other issues like:
The problem we are trying to solve is following: Say in production we have 5 java clusters and 1 C++ cluster. To release in 1 cluster, we can send all verification control queries distributed in 5 java clusters. This makes verification fast and well distributed among 5 control clusters. Now with more C++ clusters in production, we have scenarios like 3 java clusters and 3 C++ clusters in production. To make a C++ release, we don't have the parrlalesim of java control clusters like before as it reduced from 5 to 3. This is a natural progression, and it means we need to proportiately reduce control queires from java production and increase in C++ production thus start using C++ as control enviroment. This will also do the balance better, because it will run (verifier + production) queires in both flavors, instead of only in Java (today) and C++ does not get used as control. |
@amitkdutta During recent queueing SEVs I have observed a not insignificant number of verifier queries running at the time and though that they contributed to the clogging. While this change won't protect us completely from queueing, it can reduce pressure on production environment. |
95c7b88
to
e85d6bf
Compare
e85d6bf
to
0b61b8b
Compare
0b61b8b
to
402108b
Compare
Ping @rschlussel |
Description
Add 'run-determinism-analysis-on-test' property to VerifierConfig.
Then use it in
DataVerification
to pass eithertest
orcontrol
checksum toDeterminismAnalyzer
.Also make
DeterminismAnalyzer
usecontrol
by default to run analysis, while previously it was usinghelper
.Motivation and Context
This is needed in our effort to move more queries to
test
cluster to avoid overloading production (control
) clusters.We tried to pass
test
cluster ashelper
to run checksum queries ontest
, notcontrol
.However, doing just that breaks determinism analysis, because we always use control checksum in determinism analysis and running a query that returns a different result on the test cluster will cause such query to always return different result and be considered as non-deterministic and in the end being skipped rather than flagged with failure.
DeterminismAnalyser
is constructed with helper action and thus, by default, runs determinism queries oncontrol
, unlesshelper
cluster is specified. When we specifytest
cluster ashelper
, thenDeterminismAnalyser
runs determinism queries ontest
.This change allows us to specify a new option and use test checksum instead, comparing multiple runs with it and seeing if the query is truly non-deterministic.
Test Plan
Added new unit test.
Also tested on a query that returns different result in Presto and Presto C++:
Running new version, but not using helper cluster or the new property (current setup):
1 deterministic and failed. Correct outcome: column mismatch detected.
All queries ran on prod (control), except main and setup test query.
Running new version, setting helper cluster to point to test environment, but not using the new property:
1 ND and got skipped. Incorrect outcome: column mismatch skipped!
All queries ran on test, except main and setup control query.
Running new version, setting helper cluster to point to test environment and using the new property:
1 deterministic and failed. Correct outcome: column mismatch detected.
All queries ran on test, except main and setup control query.
If release note is NOT required, use: