Skip to content

[Fix] Fix distributed POST actor concurrency split#1880

Merged
zhuzilin merged 2 commits intoTHUDM:mainfrom
kaysonyu:fix/distributed-post-per-actor-concurrency
May 9, 2026
Merged

[Fix] Fix distributed POST actor concurrency split#1880
zhuzilin merged 2 commits intoTHUDM:mainfrom
kaysonyu:fix/distributed-post-per-actor-concurrency

Conversation

@kaysonyu
Copy link
Copy Markdown
Contributor

Summary

This PR fixes the per-actor concurrency calculation in distributed POST mode.

The current implementation creates args.num_gpus_per_node POST actors on each alive Ray node, but
it only divides _client_concurrency by the number of nodes when computing per_actor_conc. That
over-allocates concurrency for each actor.

This change keeps the existing actor creation behavior unchanged and only fixes the concurrency split
to use the total number of created actors.

Changes

  • Keep the current actor topology unchanged:
    • create args.num_gpus_per_node POST actors per alive Ray node
  • Change per_actor_conc from:
    • dividing by len(nodes)
  • To:
    • dividing by len(nodes) * args.num_gpus_per_node
  • Clamp the value to at least 1
  • Update the docstring to match the current actual behavior

Why

Under the current logic, when there are multiple POST actors per node, each actor gets a much larger
concurrency budget than intended.

Example:

  • 8 alive nodes
  • 8 actors per node
  • _client_concurrency = 1024

Before:

  • per_actor_conc = ceil(1024 / 8) = 128+

Expected under current actor topology:

  • total actors = 8 * 8 = 64
  • per_actor_conc = ceil(1024 / 64) = 16

Scope

This PR is intentionally minimal:

  • no topology change
  • no new config
  • no routing change
  • no behavior change outside distributed POST concurrency accounting

Testing

  • No new committed test in this PR
  • Change is limited to concurrency calculation in slime/utils/http_utils.py

@zhuzilin zhuzilin merged commit 8ef1fb4 into THUDM:main May 9, 2026
14 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants