-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] Fix repeated migration of colocate tablets #53099
Conversation
if (!olapTbl.needSchedule(isLocalBalance)) { | ||
continue; | ||
} | ||
|
||
for (Partition partition : globalStateMgr.getLocalMetastore().getAllPartitionsIncludeRecycleBin(olapTbl)) { | ||
partitionChecked++; | ||
if (partitionChecked % partitionBatchNum == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most risky bug in this code is:
The changes are adding calls to olapTable.needSchedule()
and olapTbl.needSchedule()
, leading to skipped iterations when scheduling is not needed. If needSchedule
logic is incorrect, or if skipping these parts without proper handling causes data imbalance or system performance issues, it may introduce a risk of improperly balanced clusters or missed scheduling opportunities.
You can modify the code like this:
// Assess if the conditions under which 'needSchedule' returns false might lead to critical operations being skipped inadvertently,
// and ensure that either strategic logging or alternative operations handle such cases to preserve cluster health.
private List<TabletSchedCtx> balanceClusterDisk(ClusterLoadStatistic clusterStat) {
for (/* iteration variables */) {
if (!olapTable.needSchedule(false)) {
// Add logging or checks to confirm skipping doesn't cause issues
continue;
}
if (isDestBackendLocationMismatch(olapTable, hBackend.getId(), lBackend.getId(),
physicalPartition.getParentId(), tabletId)) {
continue;
}
// rest of the function...
}
}
private void balanceBackendDisk(TStorageMedium medium, double avgUsedPercent) {
for (/* iteration variables */) {
if (olapTable == null) {
continue;
}
if (!olapTable.needSchedule(true)) {
// Add logging or checks to assess impact
continue;
}
if (isTabletUnhealthy(tabletMeta.getDbId(), olapTable, tabletId, tabletMeta, aliveBeIds)) {
continue;
}
// rest of the function...
}
}
private Map<Pair<Long, Long>, PartitionStat> getPartitionStats(TStorageMedium medium, boolean isLocalBalance) {
for (/* iteration variables */) {
if (!olapTbl.needSchedule(isLocalBalance)) {
// Consider implementing fallback checks here
continue;
}
for (Partition partition : globalStateMgr.getLocalMetastore().getAllPartitionsIncludeRecycleBin(olapTbl)) {
partitionChecked++;
if (partitionChecked % partitionBatchNum == 0) {
// additional logic...
}
}
// rest of the function...
}
}
Ensure you validate whether operations that are bypassed should still partially execute or trigger alerts/logs for evaluation.
b8619d0
to
6e049a1
Compare
|
Head branch was pushed to by a user without write access
fe/fe-core/src/main/java/com/starrocks/clone/DiskAndTabletLoadReBalancer.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to add some ut
@mxdzs0612 You can just judge it in getPartitionTablets function. This is the base function to get the tablets |
@@ -984,6 +988,11 @@ private Map<Pair<Long, Long>, Set<Long>> getPartitionTablets(long beId, TStorage | |||
} | |||
} | |||
|
|||
OlapTable olapTable = getOlapTableById(tabletMeta.getDbId(), tabletMeta.getTableId()); | |||
if (olapTable != null && !olapTable.needSchedule(true)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also a balance between clusters to call getPartitionTablets(long beId, TStorageMedium medium, long pathHash). So you cannot just set isLocalBalance to true.
The another function getPartitionTablets(Long dbId, Long tableId, Long physicalPartitionId, Long indexId, List beIds, Pair<Long, List> bePaths) should be changed too.
fe/fe-core/src/main/java/com/starrocks/clone/DiskAndTabletLoadReBalancer.java
Show resolved
Hide resolved
@Mergifyio rebase |
9f51268
to
4abae70
Compare
✅ Branch has been successfully rebased |
9474122
to
638c8b7
Compare
|
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[FE Incremental Coverage Report]❌ fail : 8 / 12 (66.67%) file detail
|
[BE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
@Mergifyio backport branch-3.4 |
@Mergifyio backport branch-3.2 |
@Mergifyio backport branch-3.3 |
✅ Backports have been created
|
✅ Backports have been created
|
✅ Backports have been created
|
Signed-off-by: Jiao Mingye <mxdzs0612@gmail.com> (cherry picked from commit 8e69edb)
Signed-off-by: Jiao Mingye <mxdzs0612@gmail.com> (cherry picked from commit 8e69edb) # Conflicts: # fe/fe-core/src/main/java/com/starrocks/clone/DiskAndTabletLoadReBalancer.java
Signed-off-by: Jiao Mingye <mxdzs0612@gmail.com> (cherry picked from commit 8e69edb) # Conflicts: # fe/fe-core/src/main/java/com/starrocks/clone/DiskAndTabletLoadReBalancer.java
Why I'm doing:
What I'm doing:
Fixes #issue
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: