-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] Report load profile if brpc reaches timeout #55494
base: main
Are you sure you want to change the base?
Conversation
4bbc693
to
f7b13c9
Compare
d5dab22
to
cbfba93
Compare
const starrocks::PLoadDiagnoseRequest* request, | ||
starrocks::PLoadDiagnoseResult* response, | ||
google::protobuf::Closure* done) { | ||
VLOG_RPC << "load diagnose, id=" << print_id(request->id()) << ", txn_id: " << request->txn_id(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VLOG_RPC << "load diagnose, id=" << print_id(request->id()) << ", txn_id: " << request->txn_id(); | |
VLOG_RPC << "load diagnose, id=" << print_id(request->id()) << ", txn_id=" << request->txn_id(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -2224,7 +2225,7 @@ public void handleDMLStmtWithProfile(ExecPlan execPlan, DmlStmt stmt) throws Exc | |||
throw t; | |||
} finally { | |||
boolean isAsync = false; | |||
if (context.isProfileEnabled()) { | |||
if (context.isProfileEnabled() || LoadErrorUtils.enableProfileAfterError(coord)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does stream load need to check LoadErrorUtils.enableProfileAfterError
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently stream load always reports profile after failure no matter what's the reason, so no need to check LoadErrorUtils.enableProfileAfterError
for stream load
// The timeout of the diagnosis rpc sent from OlapTableSink to LoadChannel | ||
CONF_mInt32(load_diagnose_send_rpc_timeout_ms, "2000"); | ||
// Used in load fail point. The brpc timeout used to simulate brpc exception "[E1008]Reached timeout" | ||
CONF_mInt32(load_fp_brpc_timeout_ms, "-1"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we support timeout
in admin enable failpoint
script later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed offline, maybe do it later
Signed-off-by: PengFei Li <lpengfei2016@gmail.com>
Signed-off-by: PengFei Li <lpengfei2016@gmail.com>
cbfba93
to
a68e6fb
Compare
|
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[FE Incremental Coverage Report]✅ pass : 15 / 17 (88.24%) file detail
|
[BE Incremental Coverage Report]✅ pass : 104 / 119 (87.39%) file detail
|
Why I'm doing:
A common issue encountered during import is brpc timeout, such as
[E1008]Reached timeout 150000ms@10.128.8.78:8060
, which is often caused by prolonged durations at certain stages on the storage side. Currently, there is a lack of information to analyze the time consumption when such problems occur. Therefore, it is desirable to provide an observability mechanism that, upon detecting a "brpc reached timeout" exception, can proactively generate a profile containing some time consumption statistics from the storage side, thereby aiding in pinpointing the cause.What I'm doing:
The brpc is sent from the Coordinator BE (OlapTableSink) to the Executor BE (LoadChannel) on the storage side. A mechanism is to be added within OlapTableSink that, upon detecting a brpc timeout, actively sends a diagnose rpc to the LoadChannel, requesting the current profile from the LoadChannel, which is then reported to the FE.
Fixes #issue
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: