executor, distsql: reduce the log of dispatching mpp tasks#68041
executor, distsql: reduce the log of dispatching mpp tasks#68041gengliqi wants to merge 4 commits intopingcap:masterfrom
Conversation
Signed-off-by: gengliqi <gengliqiii@gmail.com>
|
Skipping CI for Draft Pull Request. |
|
Hi @gengliqi. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
📝 WalkthroughWalkthrough
Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)Command failed Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/executor/internal/mpp/local_mpp_coordinator.go (1)
265-277: Aggregated dispatch log: consider a couple of small polish items.The consolidation looks good and meets the PR goal. A few optional points worth considering:
Log-string change is a breaking signal for operators. The message changed from the per-task
"Dispatch mpp task"to"Dispatch mpp tasks". Any external log alerting/grep that matched the old string will silently go dark. Worth calling out in the release note (currentlyNone) so operators can update dashboards/alerts.
task-countis redundant withIDs.zap.Int64s("IDs", taskIDs)already encodes the count;zap.Int("task-count", len(taskIDs))duplicates that info. Fine to keep for human readability, but you can drop it if terseness is preferred.Field-name style is mixed (
timestamp,IDs,QueryTs,LocalQueryId,ServerID,addresses,mpp-version,GatherID,resource_group). This mirrors the legacy log so it’s not a regression, but while you’re touching this line it’s a cheap opportunity to pick one style (the rest of the file tends toward lower-kebab likempp-version,task-id).
firstTask.StartTsvsc.startTS. For a fragment dispatched by this coordinator these should always match; usingc.startTSdirectly would avoid the implicit assumption and remove the need to readfirstTaskfor that field.No functional concerns — these are all nits.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/executor/internal/mpp/local_mpp_coordinator.go` around lines 265 - 277, Restore the original log message and tidy redundant/mixed fields in the consolidated dispatch log: in the log statement built in local_mpp_coordinator.go (the logutil.BgLogger().Info call referencing firstTask, taskIDs, pf.Sink, c.gatherID, rgName), change the message back to "Dispatch mpp task" to avoid breaking alerts, remove the redundant zap.Int("task-count", len(taskIDs)) since zap.Int64s("IDs", taskIDs) already conveys count, replace firstTask.StartTs with c.startTS to avoid the implicit dependency on firstTask, and normalize key names to a consistent lower-kebab style (e.g., query-ts, local-query-id, server-id, gather-id, resource-group) while keeping existing values (firstTask.MppQueryID.*, c.gatherID, rgName, addresses, plannercore.ToString(pf.Sink), firstTask.MppVersion, pf.Sink.GetCompressionMode()).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/executor/internal/mpp/local_mpp_coordinator.go`:
- Around line 265-277: Restore the original log message and tidy redundant/mixed
fields in the consolidated dispatch log: in the log statement built in
local_mpp_coordinator.go (the logutil.BgLogger().Info call referencing
firstTask, taskIDs, pf.Sink, c.gatherID, rgName), change the message back to
"Dispatch mpp task" to avoid breaking alerts, remove the redundant
zap.Int("task-count", len(taskIDs)) since zap.Int64s("IDs", taskIDs) already
conveys count, replace firstTask.StartTs with c.startTS to avoid the implicit
dependency on firstTask, and normalize key names to a consistent lower-kebab
style (e.g., query-ts, local-query-id, server-id, gather-id, resource-group)
while keeping existing values (firstTask.MppQueryID.*, c.gatherID, rgName,
addresses, plannercore.ToString(pf.Sink), firstTask.MppVersion,
pf.Sink.GetCompressionMode()).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: f3804ee6-fa49-4752-95de-249ffd50749a
📒 Files selected for processing (1)
pkg/executor/internal/mpp/local_mpp_coordinator.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #68041 +/- ##
================================================
- Coverage 77.7558% 77.0719% -0.6839%
================================================
Files 1990 1972 -18
Lines 551768 552429 +661
================================================
- Hits 429032 425768 -3264
- Misses 121816 126649 +4833
+ Partials 920 12 -908
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/executor/internal/mpp/local_mpp_coordinator.go (1)
265-277: Aggregated dispatch log — consider a brief comment on the invariant.The aggregated log reuses
firstTask.StartTs,firstTask.MppQueryID, andfirstTask.MppVersionto represent the whole fragment. That is correct because all tasks produced for a single fragment share the same query/gather metadata, but it's not self-evident to a future reader skimming this method. A one-line comment above the log (e.g., "all tasks in the fragment share StartTs/MppQueryID/MppVersion") would make the intent obvious and prevent someone from later "fixing" this to a per-task loop.Also note the positional correspondence between
taskIDs[i]andaddresses[i]is implicit; since this log is intended for operators rather than structured parsing, that's acceptable — just worth being aware of if downstream tooling ever starts consuming these fields.✍️ Suggested clarifying comment
if len(tasks) > 0 { + // All tasks within a fragment share the same query-level metadata + // (StartTs/MppQueryID/MppVersion), so it's safe to take them from tasks[0]. + // taskIDs[i] corresponds positionally to addresses[i]. firstTask := tasks[0] logutil.BgLogger().Info("Dispatch mpp tasks", zap.Uint64("timestamp", firstTask.StartTs),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/executor/internal/mpp/local_mpp_coordinator.go` around lines 265 - 277, Add a one-line clarifying comment immediately above the aggregated dispatch log in LocalMPPCoordinator (the block using firstTask, tasks, taskIDs, addresses, and c.gatherID) stating the invariant that all tasks in this fragment share the same StartTs, MppQueryID, and MppVersion so using firstTask values is intentional; also note that taskIDs[index] corresponds positionally to addresses[index] to avoid future accidental per-task refactors or misinterpretation of the logged arrays.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/executor/internal/mpp/local_mpp_coordinator.go`:
- Around line 265-277: Add a one-line clarifying comment immediately above the
aggregated dispatch log in LocalMPPCoordinator (the block using firstTask,
tasks, taskIDs, addresses, and c.gatherID) stating the invariant that all tasks
in this fragment share the same StartTs, MppQueryID, and MppVersion so using
firstTask values is intentional; also note that taskIDs[index] corresponds
positionally to addresses[index] to avoid future accidental per-task refactors
or misinterpretation of the logged arrays.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: f45b6296-ebb4-4063-bce9-6312f5dd607a
📒 Files selected for processing (1)
pkg/executor/internal/mpp/local_mpp_coordinator.go
|
/cc @windtalker |
|
/cc @xzhangxian1008 |
| if config.GetGlobalConfig().DisaggregatedTiFlash { | ||
| computeStores, getStoreErr := cache.GetTiFlashComputeStores( | ||
| backoff.NewBackoffer(ctx, copr.CopNextMaxBackoff).TiKVBackoffer()) | ||
| if getStoreErr == nil { |
There was a problem hiding this comment.
what happens if getStoreErr is not nil?
There was a problem hiding this comment.
Traffic that could cross zones might be counted as traffic within the same zone, and the store ID in the logs might be 0.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: xzhangxian1008 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
Signed-off-by: gengliqi <gengliqiii@gmail.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/executor/internal/mpp/local_mpp_coordinator.go (1)
988-995:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHandle
GetTiFlashComputeStoresfailures explicitly instead of silent fallback.When
getStoreErr != nil, the code silently skips compute stores. That can leaveallTiFlashStoreInfoincomplete, which later causes unknown zones /storeID=0in aggregated dispatch logs and zone-traffic misclassification.Suggested fix
if config.GetGlobalConfig().DisaggregatedTiFlash { computeStores, getStoreErr := cache.GetTiFlashComputeStores( backoff.NewBackoffer(ctx, copr.CopNextMaxBackoff).TiKVBackoffer()) - if getStoreErr == nil { + if getStoreErr != nil { + logutil.BgLogger().Warn("failed to get tiflash compute stores for mpp dispatch", + zap.Error(getStoreErr), + zap.Uint64("txnStartTS", c.startTS), + zap.Uint64("gatherID", c.gatherID)) + } else { for _, tiflashStore := range computeStores { addTiFlashStoreInfo(allTiFlashStoreInfo, tiflashStore) } } }As per coding guidelines: "Keep error handling actionable and contextual; avoid silently swallowing errors."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/executor/internal/mpp/local_mpp_coordinator.go` around lines 988 - 995, The code currently swallows errors from cache.GetTiFlashComputeStores causing missing entries in allTiFlashStoreInfo; update the DisaggregatedTiFlash branch to handle getStoreErr explicitly: check if getStoreErr != nil and then either return or propagate the error (or at minimum log it with contextual information including the backoffer, ctx/region/store identifiers and the fact it's populating allTiFlashStoreInfo) before skipping the loop; keep the existing successful path that calls addTiFlashStoreInfo for each tiflashStore, but do not silently ignore failures from GetTiFlashComputeStores.
🧹 Nitpick comments (1)
pkg/executor/internal/mpp/local_mpp_coordinator.go (1)
282-293: ⚡ Quick winConsider capping
IDs/storeIDslog payload size for very large fragments.Aggregating per fragment is great, but logging full arrays can still create huge single log entries at scale. Consider logging count + truncated samples to keep log cost predictable.
Possible approach
+ const maxLogged = 128 + loggedTaskIDs := taskIDs + loggedStoreIDs := storeIDs + truncated := false + if len(loggedTaskIDs) > maxLogged { + loggedTaskIDs = loggedTaskIDs[:maxLogged] + loggedStoreIDs = loggedStoreIDs[:maxLogged] + truncated = true + } logutil.BgLogger().Info("Dispatch mpp tasks", zap.Uint64("timestamp", firstTask.StartTs), - zap.Int64s("IDs", taskIDs), zap.Uint64s("storeIDs", storeIDs), + zap.Int("taskCount", len(taskIDs)), + zap.Bool("idsTruncated", truncated), + zap.Int64s("IDs", loggedTaskIDs), zap.Uint64s("storeIDs", loggedStoreIDs), ... )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/executor/internal/mpp/local_mpp_coordinator.go` around lines 282 - 293, The "Dispatch mpp tasks" log is currently emitting entire taskIDs and storeIDs arrays (variables taskIDs and storeIDs) which can blow up log size; change the log to instead emit their counts and small truncated samples (e.g., len(taskIDs)/len(storeIDs) plus first N elements or a hash/summary) while keeping the existing context fields (firstTask.MppQueryID.*, firstTask.MppVersion, pf.Sink plan/compression, c.gatherID, rgName, sqlDigest, planDigestStr) so callers like the Dispatch mpp tasks logging site produce predictable, bounded-sized entries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/executor/internal/mpp/local_mpp_coordinator.go`:
- Around line 988-995: The code currently swallows errors from
cache.GetTiFlashComputeStores causing missing entries in allTiFlashStoreInfo;
update the DisaggregatedTiFlash branch to handle getStoreErr explicitly: check
if getStoreErr != nil and then either return or propagate the error (or at
minimum log it with contextual information including the backoffer,
ctx/region/store identifiers and the fact it's populating allTiFlashStoreInfo)
before skipping the loop; keep the existing successful path that calls
addTiFlashStoreInfo for each tiflashStore, but do not silently ignore failures
from GetTiFlashComputeStores.
---
Nitpick comments:
In `@pkg/executor/internal/mpp/local_mpp_coordinator.go`:
- Around line 282-293: The "Dispatch mpp tasks" log is currently emitting entire
taskIDs and storeIDs arrays (variables taskIDs and storeIDs) which can blow up
log size; change the log to instead emit their counts and small truncated
samples (e.g., len(taskIDs)/len(storeIDs) plus first N elements or a
hash/summary) while keeping the existing context fields (firstTask.MppQueryID.*,
firstTask.MppVersion, pf.Sink plan/compression, c.gatherID, rgName, sqlDigest,
planDigestStr) so callers like the Dispatch mpp tasks logging site produce
predictable, bounded-sized entries.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 8d5b971d-e2d2-4071-91c0-92e96de9eadd
📒 Files selected for processing (1)
pkg/executor/internal/mpp/local_mpp_coordinator.go
What problem does this PR solve?
Issue Number: close #68040
Problem Summary:
What changed and how does it work?
Aggregate the dispatching mpp task log from one line per task to one line per fragment.
The aggregated log now records task IDs and TiFlash store IDs instead of verbose addresses.
BTW, this PR also includes TiFlash compute nodes in the store info map in disaggregated mode, so zone-based traffic statistics will not miss them.
Run TPC-H Q9 with 2 TiFlash
Before
After
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit
Note: Internal optimizations only; no visible end-user changes.