Replication priority: scope boosts to hosting threads, clean up on replica removal, report thread name#3269
Open
Qian1900 wants to merge 4 commits into
Conversation
…plica removal, report thread name Improvements to the replication-priority admin API, surfaced by live Set/List/Clear testing on ei-ltx1 perf: - prioritizePartitions only applies a boost to partitions the thread actually hosts. Previously the engine fanned the boost to every thread, including ones with no replica of the partition; those entries never auto-prune (the empty-infos case is intentionally not pruned), leaving stale "ghost" priorities that linger until manual unset or restart. - removeRemoteReplicaInfo drops a partition's priority once the thread's last replica of it is removed (decommission / rebalance), guarded by !hostsPartition so it is retained while another peer replica remains. - PriorityEntry / ListReplicationPriorityAdminResponse now carry the holding ReplicaThread name so otherwise-identical per-thread List rows (same partition/boost/isInterColo) are distinguishable. Testing Done: - New unit tests: prioritizePartitions skips non-hosted partitions; removeRemoteReplicaInfo clears the last-replica orphan and retains the priority while a peer replica remains; ListReplicationPriorityAdminResponse round-trips threadName incl. the null -> "" contract. - Re-ran AmbryServerRequestsTest replication-priority handler tests, RemoteReplicaGroupPollerTest, and the auto-prune predicate tests: all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #3269 +/- ##
=============================================
- Coverage 64.24% 50.86% -13.39%
+ Complexity 10398 8679 -1719
=============================================
Files 840 937 +97
Lines 71755 80229 +8474
Branches 8611 9639 +1028
=============================================
- Hits 46099 40805 -5294
- Misses 23004 36034 +13030
- Partials 2652 3390 +738 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…sort Entries here are always created with thread.getName() (never null), so the "== null ? \"\"" guard in the comparator handled an impossible case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ation-priority-ghost-entries
linkedin#3268 (ServerAdminTool CLI) merged to master while linkedin#3269 deletes the 3-arg PriorityEntry ctor in favor of the threadName-carrying 4-arg ctor. Update the 8 ReplicationPriorityAdminHelperTest construction sites to the 4-arg form (null threadName — these tests only assert partitionId/boost/interColo in the result JSON). Fixes the :ambry-tools:compileTestJava failure on the merged tree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Improvements to the replication-priority admin API (#3261), surfaced by live Set/List/Clear testing on ei-ltx1 perf:
prioritizePartitionsonly boosts partitions the thread hosts. PreviouslyReplicationEngine.prioritizePartitionsfanned the boost to everyReplicaThread, including threads with no replica of the partition. Those entries never auto-prune (the empty-infos case inshouldAutoPrunePriorityis intentionally not pruned), so they linger as stale "ghost" priorities until a manual unset or restart. Now a thread only accepts a boost for a partition it actually replicates (newhostsPartitioncheck).removeRemoteReplicaInfoclears an orphaned priority once the thread's last replica of a partition is removed (decommission / rebalance). Guarded by!hostsPartitionso the priority is retained while another peer replica of the same partition remains on the thread.PriorityEntry/ListReplicationPriorityAdminResponsenow carry the holdingReplicaThreadname, so otherwise-identical per-threadListrows (same partition / boost / isInterColo) are distinguishable.Wire compatibility
The thread name is appended to each entry on the existing response version (no version bump). This is safe here because the response is ephemeral (not persisted) and this request/response pair was introduced together in the same unreleased feature line — the only writer is the server and the only reader is
ServerAdminTool(not yet merged), so there is no old-reader-vs-new-writer skew for the field.threadNameis nullable and serialized as a length-prefixed UTF-8 string (null→"").Testing Done
ReplicationTest.prioritizePartitionsSkipsNonHostedPartitionsTest— a boost targeting a non-hosted partition is skipped.ReplicationTest.removeRemoteReplicaInfoClearsOrphanedPriorityTest— removing the last replica clears the priority.ReplicationTest.removeRemoteReplicaInfoRetainsPriorityWhileAnotherReplicaRemainsTest— removing one of two peer replicas of a partition retains the priority; removing the last clears it (regression guard for the!hostsPartitioncheck; verified by mutation — deleting the guard fails this test).RequestResponseTest.listReplicationPriorityAdminResponseTestextended to assertthreadNameround-trips; new...NullThreadNameTestpins thenull→""contract.AmbryServerRequestsTestreplication-priority handler tests (40),RemoteReplicaGroupPollerTest(14), andReplicaThreadPriorityAutoPruneTest(8): all pass.Durability
No durability risk. This is a replication-scheduling hint (fetch-budget weight) on the read/observability path — it does not touch PUT / named-blob PUT / TTL / delete, metadata storage, or callback semantics. Removing a priority on a non-hosting thread is a no-op (that thread never fetched the partition); removing it on actual replica removal is correct.