Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Report load profile if brpc reaches timeout #55494

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

banmoy
Copy link
Contributor

@banmoy banmoy commented Jan 27, 2025

Why I'm doing:

A common issue encountered during import is brpc timeout, such as [E1008]Reached timeout 150000ms@10.128.8.78:8060, which is often caused by prolonged durations at certain stages on the storage side. Currently, there is a lack of information to analyze the time consumption when such problems occur. Therefore, it is desirable to provide an observability mechanism that, upon detecting a "brpc reached timeout" exception, can proactively generate a profile containing some time consumption statistics from the storage side, thereby aiding in pinpointing the cause.

What I'm doing:

The brpc is sent from the Coordinator BE (OlapTableSink) to the Executor BE (LoadChannel) on the storage side. A mechanism is to be added within OlapTableSink that, upon detecting a brpc timeout, actively sends a diagnose rpc to the LoadChannel, requesting the current profile from the LoadChannel, which is then reported to the FE.

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.4
    • 3.3
    • 3.2
    • 3.1
    • 3.0

@banmoy banmoy requested review from a team as code owners January 27, 2025 08:21
@wanpengfei-git wanpengfei-git requested a review from a team January 27, 2025 08:22
@mergify mergify bot assigned banmoy Jan 27, 2025
@banmoy banmoy changed the title [Feature] Report load profile if brpc reaches timeout [Enhancement] Report load profile if brpc reaches timeout Jan 27, 2025
@banmoy banmoy force-pushed the reach_timeout_profile branch 2 times, most recently from 4bbc693 to f7b13c9 Compare January 27, 2025 12:27
@banmoy banmoy force-pushed the reach_timeout_profile branch 5 times, most recently from d5dab22 to cbfba93 Compare February 12, 2025 04:45
@github-actions github-actions bot added the 3.4 label Feb 12, 2025
const starrocks::PLoadDiagnoseRequest* request,
starrocks::PLoadDiagnoseResult* response,
google::protobuf::Closure* done) {
VLOG_RPC << "load diagnose, id=" << print_id(request->id()) << ", txn_id: " << request->txn_id();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
VLOG_RPC << "load diagnose, id=" << print_id(request->id()) << ", txn_id: " << request->txn_id();
VLOG_RPC << "load diagnose, id=" << print_id(request->id()) << ", txn_id=" << request->txn_id();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -2224,7 +2225,7 @@ public void handleDMLStmtWithProfile(ExecPlan execPlan, DmlStmt stmt) throws Exc
throw t;
} finally {
boolean isAsync = false;
if (context.isProfileEnabled()) {
if (context.isProfileEnabled() || LoadErrorUtils.enableProfileAfterError(coord)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does stream load need to check LoadErrorUtils.enableProfileAfterError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently stream load always reports profile after failure no matter what's the reason, so no need to check LoadErrorUtils.enableProfileAfterError for stream load

// The timeout of the diagnosis rpc sent from OlapTableSink to LoadChannel
CONF_mInt32(load_diagnose_send_rpc_timeout_ms, "2000");
// Used in load fail point. The brpc timeout used to simulate brpc exception "[E1008]Reached timeout"
CONF_mInt32(load_fp_brpc_timeout_ms, "-1");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we support timeout in admin enable failpoint script later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed offline, maybe do it later

Signed-off-by: PengFei Li <lpengfei2016@gmail.com>
Signed-off-by: PengFei Li <lpengfei2016@gmail.com>
@banmoy banmoy force-pushed the reach_timeout_profile branch from cbfba93 to a68e6fb Compare February 21, 2025 05:21
Signed-off-by: PengFei Li <lpengfei2016@gmail.com>
Copy link

Copy link

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[FE Incremental Coverage Report]

pass : 15 / 17 (88.24%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/qe/ConnectContext.java 0 1 00.00% [791]
🔵 com/starrocks/load/loadv2/LoadErrorUtils.java 9 10 90.00% [65]
🔵 com/starrocks/qe/StmtExecutor.java 1 1 100.00% []
🔵 com/starrocks/qe/scheduler/dag/JobSpec.java 1 1 100.00% []
🔵 com/starrocks/qe/DefaultCoordinator.java 1 1 100.00% []
🔵 com/starrocks/qe/scheduler/QueryRuntimeProfile.java 1 1 100.00% []
🔵 com/starrocks/load/loadv2/LoadLoadingTask.java 2 2 100.00% []

Copy link

[BE Incremental Coverage Report]

pass : 104 / 119 (87.39%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/service/internal_service.cpp 0 2 00.00% [423, 425]
🔵 be/src/runtime/load_channel.cpp 16 20 80.00% [439, 449, 452, 482]
🔵 be/src/exec/tablet_sink_index_channel.cpp 68 77 88.31% [778, 1095, 1096, 1117, 1134, 1135, 1136, 1155, 1156]
🔵 be/src/runtime/load_channel_mgr.cpp 11 11 100.00% []
🔵 be/src/service/service_be/internal_service.cpp 4 4 100.00% []
🔵 be/src/util/internal_service_recoverable_stub.cpp 4 4 100.00% []
🔵 be/src/runtime/local_tablets_channel.cpp 1 1 100.00% []

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants