Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[verifier] Add 'run-determinism-analysis-on-test' property to VerifierConfig #24492

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

spershin
Copy link
Contributor

@spershin spershin commented Feb 4, 2025

Description

Add 'run-determinism-analysis-on-test' property to VerifierConfig.
Then use it in DataVerification to pass either test or control checksum to DeterminismAnalyzer.
Also make DeterminismAnalyzer use control by default to run analysis, while previously it was using helper.

Motivation and Context

This is needed in our effort to move more queries to test cluster to avoid overloading production (control) clusters.

We tried to pass test cluster as helper to run checksum queries on test, not control.

However, doing just that breaks determinism analysis, because we always use control checksum in determinism analysis and running a query that returns a different result on the test cluster will cause such query to always return different result and be considered as non-deterministic and in the end being skipped rather than flagged with failure.

DeterminismAnalyser is constructed with helper action and thus, by default, runs determinism queries on control, unless helper cluster is specified. When we specify test cluster as helper, then DeterminismAnalyser runs determinism queries on test.

This change allows us to specify a new option and use test checksum instead, comparing multiple runs with it and seeing if the query is truly non-deterministic.

Test Plan

Added new unit test.

Also tested on a query that returns different result in Presto and Presto C++:

  • Running new version, but not using helper cluster or the new property (current setup):
    1 deterministic and failed. Correct outcome: column mismatch detected.
    All queries ran on prod (control), except main and setup test query.

  • Running new version, setting helper cluster to point to test environment, but not using the new property:
    1 ND and got skipped. Incorrect outcome: column mismatch skipped!
    All queries ran on test, except main and setup control query.

  • Running new version, setting helper cluster to point to test environment and using the new property:
    1 deterministic and failed. Correct outcome: column mismatch detected.
    All queries ran on test, except main and setup control query.

If release note is NOT required, use:

== NO RELEASE NOTE ==

@spershin spershin requested a review from a team as a code owner February 4, 2025 19:37
@spershin spershin requested a review from presto-oss February 4, 2025 19:37
@prestodb-ci prestodb-ci added the from:Meta PR from Meta label Feb 4, 2025
@rschlussel
Copy link
Contributor

i'm confused about what the problem is and how this fixes it. if you have something that deterministically returns some result in control and non-deterministically returns another result in test, then you've introduced a bug. Wouldn't this hide those issues?

@spershin
Copy link
Contributor Author

spershin commented Feb 4, 2025

i'm confused about what the problem is and how this fixes it. if you have something that deterministically returns some result in control and non-deterministically returns another result in test, then you've introduced a bug. Wouldn't this hide those issues?

@rschlussel
You are right, running determinism analysis on test cluster bears a certain risk: in case control is correct and test is randomly incorrect the query will receive the non-deterministic status when running multiple times on the test and will be skipped and won't trigger any alerts/investigations.

This, however, will not happen often and is a calculated risk we can take with Presto C++ to reduce the compute wasted/occupied in production environment by verification.
Especially when we are getting closer to start verification of Presto C++ against a previous version of Presto C++ (like Presto does).

The initial problem is when specifying test cluster as helper we have ALL queries with different results marked as non-deterministic, because running duplicates on test and always comparing with control's checksum.

Did this answer your concern/question?

@spershin
Copy link
Contributor Author

spershin commented Feb 4, 2025

Discussed offline with @rschlussel

Decided to have two stage determinism analysis in our case:
When this flag is specified, we

  1. Run determinism analysis on helper (test) with test checksum.
  2. If query deemed non-deterministic, we run the 2nd determinism analysis on control with control checksum.

If the flag is not specified we act as before - run single determinism analysis on helper with control checksum.

@spershin spershin force-pushed the MakeDeterminismAnalyzerNotUseControlChecksum branch from f913c56 to 95c7b88 Compare February 5, 2025 00:19
@spershin
Copy link
Contributor Author

spershin commented Feb 5, 2025

@rschlussel Updated the PR.

return determinismAnalyzer.analyze(control, matchResult.getControlChecksum(), getControlAction());
}
// Default behavior - we run determinism analysis, which uses helper action with control checksum.
return determinismAnalyzer.analyze(control, matchResult.getControlChecksum(), null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it makes sense to ever use the "helper" action here. if you want to check if something is deterministic, you need to run it in the original environment. We should change the behavior and always pass the control action. @singcha will this break anything for presto-on-spark verification?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rschlussel
I understand what you are saying and was quite surprised to see that we use helper action to do determinism analysis.
However, this change is to allow less compute on the control side rather than to fix the determinism analysis.
I would like to keep this change in its scope and avoid changing anything outside of it.
Can we follow this PR up with moving determinism analysis to control by default?

{
// In case the action is not specified, use the default one we were constructed with.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would make sense to remove the queryAction field from the class and then always pass in the queryAction here (never let it be null).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rschlussel
We will still need HelperAction to run checksum and other types of helper queries.
I wonder if we can refactor this in the followup PR?
This PR would really help us unload prod clusters sooner and not break any existing use cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do it in a second commit in this PR? That way revert is easier if something breaks with the refactoring, but if nothing breaks, we leave things in a better state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rschlussel
Actually, we might want to leave the queryAction field, but rename it to the helperAction, because we want to use it to run checksums regardless of using control or test for determinism analysis.
But we enforce overrideActionForQuery (rename it to queryAction) to be non-null and use it for issuing the query itself, like you are proposing.

Sounds good?
I'll try adding most of this in the second commit.

Copy link
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, would be good to add a test.

@spershin
Copy link
Contributor Author

spershin commented Feb 5, 2025

also, would be good to add a test.

@rschlussel
Agree on that, but I would not know where to start here.
The TestDeterminismAnalyzer is very rudimentary.
There is TestLimitQueryDeterminismAnalyzer, but no way of figuring out where we ran our queries - test or control side.

@rschlussel
Copy link
Contributor

also, would be good to add a test.

@rschlussel Agree on that, but I would not know where to start here. The TestDeterminismAnalyzer is very rudimentary. There is TestLimitQueryDeterminismAnalyzer, but no way of figuring out where we ran our queries - test or control side.

I would take a similar approach to TestLimitQueryDeterminismAnalyzer. Mock out the Test and control PrestoActions. Then for some fake queries, if test result is non-deterministic (e.g. the mock action increments results by 1 each time you call it) and control is deterministic, result is non-deterministic. If both are non-deterministic, then the result is non-deterministic.

@amitkdutta
Copy link
Contributor

@spershin

This is needed in our effort to move helper queries to test clusters to avoid overloading production (control) clusters.

Isn't verifier queries are capped within its own resource groups? That means, even if we send more queries in verifier resource group, it will be queued and won't hurt production queries as they have separate queue.

From our production issues, the queueing problems are not directly releated to verification, but a few other issues like:

  • Not tagging eligiblity from different clients.
  • Critical missing items (e.g. json functions) that can make production queires go towards java environment.

The problem we are trying to solve is following: Say in production we have 5 java clusters and 1 C++ cluster. To release in 1 cluster, we can send all verification control queries distributed in 5 java clusters. This makes verification fast and well distributed among 5 control clusters. Now with more C++ clusters in production, we have scenarios like 3 java clusters and 3 C++ clusters in production. To make a C++ release, we don't have the parrlalesim of java control clusters like before as it reduced from 5 to 3. This is a natural progression, and it means we need to proportiately reduce control queires from java production and increase in C++ production thus start using C++ as control enviroment. This will also do the balance better, because it will run (verifier + production) queires in both flavors, instead of only in Java (today) and C++ does not get used as control.

@spershin
Copy link
Contributor Author

spershin commented Feb 5, 2025

@amitkdutta
Even though the resource group is different for verification queries, the resource groups don't have good isolation: they compete for resources (CPU and memory) and also take slots from the top-most resource group.
From what I understand child resource groups often (if not always) have more slots in total than the top-most group.

During recent queueing SEVs I have observed a not insignificant number of verifier queries running at the time and though that they contributed to the clogging.

While this change won't protect us completely from queueing, it can reduce pressure on production environment.

@spershin spershin force-pushed the MakeDeterminismAnalyzerNotUseControlChecksum branch from 95c7b88 to e85d6bf Compare February 6, 2025 06:25
@spershin spershin requested a review from rschlussel February 6, 2025 06:25
@spershin spershin changed the title [verifier] Add 'use-test-checksum-in-determinism-analyzer' property to VerifierConfig [verifier] Add 'run-determinism-analysis-on-test' property to VerifierConfig Feb 6, 2025
@spershin spershin force-pushed the MakeDeterminismAnalyzerNotUseControlChecksum branch from e85d6bf to 0b61b8b Compare February 6, 2025 06:35
@spershin spershin force-pushed the MakeDeterminismAnalyzerNotUseControlChecksum branch from 0b61b8b to 402108b Compare February 6, 2025 18:20
@spershin
Copy link
Contributor Author

spershin commented Feb 7, 2025

Ping @rschlussel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
from:Meta PR from Meta
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants