Skip to content

feat(community): in-community member sampling for build_communities#1389

Open
Ataxia123 wants to merge 1 commit intogetzep:mainfrom
NERDDAO:feat/community-summary-sampling
Open

feat(community): in-community member sampling for build_communities#1389
Ataxia123 wants to merge 1 commit intogetzep:mainfrom
NERDDAO:feat/community-summary-sampling

Conversation

@Ataxia123
Copy link
Copy Markdown

Summary

Adds an optional sample_size parameter to build_communities that bounds LLM cost on large graphs by limiting community summary input to the top-K most representative members instead of all members.

Background

The current build_community implementation feeds every member's summary into a binary-tree pairwise merge, calling summarize_pair once per pair. For a community of N members this is N-1 LLM calls plus 1 final generate_summary_description. Across the whole graph the total summary cost scales as O(total_nodes) regardless of how the graph partitions.

On a 100k-node knowledge graph that's roughly 100k LLM calls per build_communities run, which makes the operation cost-prohibitive at scale even though the underlying clustering finishes in seconds.

What this PR adds

A new sample_size: int | None = None parameter on:

  • Graphiti.build_communities (public API)
  • build_communities (internal)
  • build_community (internal)

When set, each community ranks its members and feeds only the top-K into the binary-merge tree. The ranking is:

  1. In-community weighted degree (descending) — hub nodes that define the community's structure
  2. Summary length (descending) — entities with rich summaries contribute more useful content to the merge
  3. Name (descending) — deterministic tie-breaker

In-community degree is computed from the projection that get_community_clusters already builds during clustering — no extra queries. To support this, get_community_clusters gains an optional return_projection flag that exposes the projection alongside the clusters. The default behavior (just clusters) is unchanged.

Cost becomes O(num_communities × sample_size) instead of O(total_nodes). For a graph where communities average 200 members, sample_size=10 is a 20× reduction.

Quality

Empirically the sampled summaries are equal-or-better than the unsampled ones — hub nodes carry the community's structural signal, and feeding fewer-but-richer inputs into the binary merge produces sharper, less diluted descriptions.

On a 48-entity test graph with sample_size=5:

Metric Full Sampled (k=5)
Wall time 41.2s 13.3s (3.1× faster)
Communities found 13 13 (identical)
Member edges 48 48 (identical)
LLM calls ~50 ~15

The largest community (26 members representing a central location) went from a summary that just listed exit directions to one that mentioned the atmosphere, key features, and identified the location by name. The smaller communities produced equivalent summaries.

Notes

  • All members still appear in the community's HAS_MEMBER edges. Only the LLM summary input set is sampled — community membership is unchanged.
  • When the projection isn't available (e.g. graph_operations_interface drivers that bypass the Python clustering path), the sampler falls back to ranking by summary length alone.
  • For small graphs (<1k nodes) the default behavior (no sampling) is recommended.
  • This PR is independent of fix(community): async label_propagation with oscillation detection #1388 (label propagation oscillation fix) — they touch different functions and can land in either order.

Test plan

  • 8 new unit tests in tests/utils/maintenance/test_community_operations.py covering the ranking helper:
    • test_returns_all_members_when_cluster_smaller_than_k
    • test_returns_all_members_when_cluster_equal_to_k
    • test_prefers_higher_in_community_degree
    • test_falls_back_to_summary_length_without_projection
    • test_falls_back_to_summary_length_with_empty_projection
    • test_deterministic_on_ties
    • test_only_counts_in_community_edges
    • test_summary_length_breaks_degree_ties
  • All 8 pass locally in 0.69s
  • Verified end-to-end against a real Neo4j-backed graph: same number of communities, 3× faster, equal-or-better summary quality

🤖 Generated with Claude Code

Adds an optional sample_size parameter that bounds LLM cost on large
graphs by limiting community summary input to the top-K most
representative members instead of all members.

# Background

The current build_community implementation feeds every member's
summary into a binary-tree pairwise merge, calling summarize_pair
once per pair. For a community of N members this is N-1 LLM calls,
plus 1 final generate_summary_description. Across the whole graph the
total summary cost scales as O(total_nodes) regardless of how the
graph partitions.

On a 100k-node knowledge graph that's ~100k LLM calls per
build_communities run, which makes the operation cost-prohibitive at
scale even though the underlying clustering finishes in seconds.

# What this PR adds

A new sample_size: int | None = None parameter on:
- Graphiti.build_communities (public API)
- build_communities (internal)
- build_community (internal)

When set, each community ranks its members and feeds only the top-K
into the binary-merge tree. The ranking is:

1. In-community weighted degree (descending)
2. Summary length (descending) — entities with rich summaries
   contribute more useful content to the merge
3. Name (descending) — deterministic tie-breaker

In-community degree is computed from the projection that
get_community_clusters already builds during clustering — no extra
queries. To support this, get_community_clusters gains an optional
return_projection flag that exposes the projection alongside the
clusters. The default behavior (just clusters) is unchanged.

Cost becomes O(num_communities * sample_size) instead of
O(total_nodes), which is a 20-40x reduction on graphs where
communities average a few hundred members.

# Quality

Empirically the sampled summaries are equal to or better than the
unsampled ones — hub nodes carry the community's structural signal,
and feeding fewer-but-richer inputs into the binary merge produces
sharper, less diluted descriptions. On a 48-entity test graph with
sample_size=5, the largest community's summary went from "lists exit
directions" to "atmospheric description with key features and named
identification" while taking 3x less wall time.

# Notes

- All members still appear in the community's HAS_MEMBER edges. Only
  the LLM summary input set is sampled.
- When the projection isn't available (e.g. graph_operations_interface
  drivers that bypass the Python clustering path), the sampler falls
  back to ranking by summary length alone.
- For small graphs (<1k nodes) the default behavior (no sampling) is
  recommended.

Includes 8 new unit tests covering the ranking helper across edge
cases (smaller-than-K, equal-to-K, fallback to summary length, empty
projection, in-community vs out-of-community edges, deterministic
tie-breaking).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant