feat(community): in-community member sampling for build_communities#1389
Open
Ataxia123 wants to merge 1 commit intogetzep:mainfrom
Open
feat(community): in-community member sampling for build_communities#1389Ataxia123 wants to merge 1 commit intogetzep:mainfrom
Ataxia123 wants to merge 1 commit intogetzep:mainfrom
Conversation
Adds an optional sample_size parameter that bounds LLM cost on large graphs by limiting community summary input to the top-K most representative members instead of all members. # Background The current build_community implementation feeds every member's summary into a binary-tree pairwise merge, calling summarize_pair once per pair. For a community of N members this is N-1 LLM calls, plus 1 final generate_summary_description. Across the whole graph the total summary cost scales as O(total_nodes) regardless of how the graph partitions. On a 100k-node knowledge graph that's ~100k LLM calls per build_communities run, which makes the operation cost-prohibitive at scale even though the underlying clustering finishes in seconds. # What this PR adds A new sample_size: int | None = None parameter on: - Graphiti.build_communities (public API) - build_communities (internal) - build_community (internal) When set, each community ranks its members and feeds only the top-K into the binary-merge tree. The ranking is: 1. In-community weighted degree (descending) 2. Summary length (descending) — entities with rich summaries contribute more useful content to the merge 3. Name (descending) — deterministic tie-breaker In-community degree is computed from the projection that get_community_clusters already builds during clustering — no extra queries. To support this, get_community_clusters gains an optional return_projection flag that exposes the projection alongside the clusters. The default behavior (just clusters) is unchanged. Cost becomes O(num_communities * sample_size) instead of O(total_nodes), which is a 20-40x reduction on graphs where communities average a few hundred members. # Quality Empirically the sampled summaries are equal to or better than the unsampled ones — hub nodes carry the community's structural signal, and feeding fewer-but-richer inputs into the binary merge produces sharper, less diluted descriptions. On a 48-entity test graph with sample_size=5, the largest community's summary went from "lists exit directions" to "atmospheric description with key features and named identification" while taking 3x less wall time. # Notes - All members still appear in the community's HAS_MEMBER edges. Only the LLM summary input set is sampled. - When the projection isn't available (e.g. graph_operations_interface drivers that bypass the Python clustering path), the sampler falls back to ranking by summary length alone. - For small graphs (<1k nodes) the default behavior (no sampling) is recommended. Includes 8 new unit tests covering the ranking helper across edge cases (smaller-than-K, equal-to-K, fallback to summary length, empty projection, in-community vs out-of-community edges, deterministic tie-breaking). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an optional
sample_sizeparameter tobuild_communitiesthat bounds LLM cost on large graphs by limiting community summary input to the top-K most representative members instead of all members.Background
The current
build_communityimplementation feeds every member's summary into a binary-tree pairwise merge, callingsummarize_paironce per pair. For a community of N members this is N-1 LLM calls plus 1 finalgenerate_summary_description. Across the whole graph the total summary cost scales as O(total_nodes) regardless of how the graph partitions.On a 100k-node knowledge graph that's roughly 100k LLM calls per
build_communitiesrun, which makes the operation cost-prohibitive at scale even though the underlying clustering finishes in seconds.What this PR adds
A new
sample_size: int | None = Noneparameter on:Graphiti.build_communities(public API)build_communities(internal)build_community(internal)When set, each community ranks its members and feeds only the top-K into the binary-merge tree. The ranking is:
In-community degree is computed from the projection that
get_community_clustersalready builds during clustering — no extra queries. To support this,get_community_clustersgains an optionalreturn_projectionflag that exposes the projection alongside the clusters. The default behavior (just clusters) is unchanged.Cost becomes O(num_communities × sample_size) instead of O(total_nodes). For a graph where communities average 200 members,
sample_size=10is a 20× reduction.Quality
Empirically the sampled summaries are equal-or-better than the unsampled ones — hub nodes carry the community's structural signal, and feeding fewer-but-richer inputs into the binary merge produces sharper, less diluted descriptions.
On a 48-entity test graph with
sample_size=5:The largest community (26 members representing a central location) went from a summary that just listed exit directions to one that mentioned the atmosphere, key features, and identified the location by name. The smaller communities produced equivalent summaries.
Notes
HAS_MEMBERedges. Only the LLM summary input set is sampled — community membership is unchanged.graph_operations_interfacedrivers that bypass the Python clustering path), the sampler falls back to ranking by summary length alone.Test plan
tests/utils/maintenance/test_community_operations.pycovering the ranking helper:test_returns_all_members_when_cluster_smaller_than_ktest_returns_all_members_when_cluster_equal_to_ktest_prefers_higher_in_community_degreetest_falls_back_to_summary_length_without_projectiontest_falls_back_to_summary_length_with_empty_projectiontest_deterministic_on_tiestest_only_counts_in_community_edgestest_summary_length_breaks_degree_ties🤖 Generated with Claude Code