Skip to content

Replication priority: scope boosts to hosting threads, clean up on replica removal, report thread name#3269

Open
Qian1900 wants to merge 4 commits into
linkedin:masterfrom
Qian1900:qianwang/fix-replication-priority-ghost-entries
Open

Replication priority: scope boosts to hosting threads, clean up on replica removal, report thread name#3269
Qian1900 wants to merge 4 commits into
linkedin:masterfrom
Qian1900:qianwang/fix-replication-priority-ghost-entries

Conversation

@Qian1900

Copy link
Copy Markdown
Contributor

Summary

Improvements to the replication-priority admin API (#3261), surfaced by live Set/List/Clear testing on ei-ltx1 perf:

  • prioritizePartitions only boosts partitions the thread hosts. Previously ReplicationEngine.prioritizePartitions fanned the boost to every ReplicaThread, including threads with no replica of the partition. Those entries never auto-prune (the empty-infos case in shouldAutoPrunePriority is intentionally not pruned), so they linger as stale "ghost" priorities until a manual unset or restart. Now a thread only accepts a boost for a partition it actually replicates (new hostsPartition check).
  • removeRemoteReplicaInfo clears an orphaned priority once the thread's last replica of a partition is removed (decommission / rebalance). Guarded by !hostsPartition so the priority is retained while another peer replica of the same partition remains on the thread.
  • PriorityEntry / ListReplicationPriorityAdminResponse now carry the holding ReplicaThread name, so otherwise-identical per-thread List rows (same partition / boost / isInterColo) are distinguishable.

Wire compatibility

The thread name is appended to each entry on the existing response version (no version bump). This is safe here because the response is ephemeral (not persisted) and this request/response pair was introduced together in the same unreleased feature line — the only writer is the server and the only reader is ServerAdminTool (not yet merged), so there is no old-reader-vs-new-writer skew for the field. threadName is nullable and serialized as a length-prefixed UTF-8 string (null"").

The CLI surfacing of threadName in the List JSON output lands separately in #3268 (the ServerAdminTool lives on that branch).

Testing Done

  • New unit tests:
    • ReplicationTest.prioritizePartitionsSkipsNonHostedPartitionsTest — a boost targeting a non-hosted partition is skipped.
    • ReplicationTest.removeRemoteReplicaInfoClearsOrphanedPriorityTest — removing the last replica clears the priority.
    • ReplicationTest.removeRemoteReplicaInfoRetainsPriorityWhileAnotherReplicaRemainsTest — removing one of two peer replicas of a partition retains the priority; removing the last clears it (regression guard for the !hostsPartition check; verified by mutation — deleting the guard fails this test).
    • RequestResponseTest.listReplicationPriorityAdminResponseTest extended to assert threadName round-trips; new ...NullThreadNameTest pins the null"" contract.
  • Re-ran AmbryServerRequestsTest replication-priority handler tests (40), RemoteReplicaGroupPollerTest (14), and ReplicaThreadPriorityAutoPruneTest (8): all pass.

Durability

No durability risk. This is a replication-scheduling hint (fetch-budget weight) on the read/observability path — it does not touch PUT / named-blob PUT / TTL / delete, metadata storage, or callback semantics. Removing a priority on a non-hosting thread is a no-op (that thread never fetched the partition); removing it on actual replica removal is correct.

…plica removal, report thread name

Improvements to the replication-priority admin API, surfaced by live
Set/List/Clear testing on ei-ltx1 perf:

- prioritizePartitions only applies a boost to partitions the thread
  actually hosts. Previously the engine fanned the boost to every thread,
  including ones with no replica of the partition; those entries never
  auto-prune (the empty-infos case is intentionally not pruned), leaving
  stale "ghost" priorities that linger until manual unset or restart.
- removeRemoteReplicaInfo drops a partition's priority once the thread's
  last replica of it is removed (decommission / rebalance), guarded by
  !hostsPartition so it is retained while another peer replica remains.
- PriorityEntry / ListReplicationPriorityAdminResponse now carry the
  holding ReplicaThread name so otherwise-identical per-thread List rows
  (same partition/boost/isInterColo) are distinguishable.

Testing Done:
- New unit tests: prioritizePartitions skips non-hosted partitions;
  removeRemoteReplicaInfo clears the last-replica orphan and retains the
  priority while a peer replica remains; ListReplicationPriorityAdminResponse
  round-trips threadName incl. the null -> "" contract.
- Re-ran AmbryServerRequestsTest replication-priority handler tests,
  RemoteReplicaGroupPollerTest, and the auto-prune predicate tests: all pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov-commenter

codecov-commenter commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 43 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.86%. Comparing base (52ba813) to head (8a695b7).
⚠️ Report is 396 commits behind head on master.

Files with missing lines Patch % Lines
...va/com/github/ambry/replication/ReplicaThread.java 0.00% 26 Missing ⚠️
...protocol/ListReplicationPriorityAdminResponse.java 0.00% 10 Missing ⚠️
...va/com/github/ambry/replication/PriorityEntry.java 0.00% 4 Missing ⚠️
...om/github/ambry/replication/ReplicationEngine.java 0.00% 3 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             master    #3269       +/-   ##
=============================================
- Coverage     64.24%   50.86%   -13.39%     
+ Complexity    10398     8679     -1719     
=============================================
  Files           840      937       +97     
  Lines         71755    80229     +8474     
  Branches       8611     9639     +1028     
=============================================
- Hits          46099    40805     -5294     
- Misses        23004    36034    +13030     
- Partials       2652     3390      +738     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Qian1900 and others added 3 commits June 10, 2026 17:25
…sort

Entries here are always created with thread.getName() (never null), so the
"== null ? \"\"" guard in the comparator handled an impossible case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
linkedin#3268 (ServerAdminTool CLI) merged to master while linkedin#3269 deletes the
3-arg PriorityEntry ctor in favor of the threadName-carrying 4-arg ctor.
Update the 8 ReplicationPriorityAdminHelperTest construction sites to the
4-arg form (null threadName — these tests only assert
partitionId/boost/interColo in the result JSON). Fixes the
:ambry-tools:compileTestJava failure on the merged tree.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants