08. Community Detection — Mapping the districts of the knowledge graph¶

~14 min read. Group densely-connected stations into communities so broad queries don't need exhaustive traversal.

Continues from the first-principles overview in 00-first-principles.md. Communities on the knowledge graph are like city districts — clusters of entities more connected to each other than to the rest of the network. Each district gets a summary.

1) Why communities? The global query problem¶

A query like "What are the main themes in this document corpus?" has no single starting entity. Local traversal from any one node misses most of the relevant content.

Communities solve this by pre-clustering the knowledge graph into coherent groups. Each community gets a pre-generated summary. A broad query retrieves relevant community summaries instead of traversing from scratch.

  full knowledge graph               after community detection
  ───────────────              ─────────────────────────
  (thousands of stations)      ┌──────────────┐   ┌──────────────┐
  all connected                │  Community 1  │   │  Community 2  │
  no structure                 │  AI/ML nodes  │   │  Cloud infra  │
                               │  (dense)      │   │  (dense)      │
                               └──────┬────────┘   └──────┬────────┘
                                      │ sparse               │
                                      └──────────────────────┘

Dense internal edges, sparse cross-community edges. That's modularity at work.

2) Modularity: the quality score for communities¶

Modularity Q measures how much the edge density inside communities exceeds chance.

Q = (1/2m) × Σ [A_ij - k_i × k_j / 2m] × δ(c_i, c_j)

Where: - m = total edges in the graph - A_ij = 1 if relationship exists between station i and j - k_i = degree of station i (number of relationships) - δ(c_i, c_j) = 1 if i and j are in the same community

Q ranges from -0.5 (worse than random) to ~1.0 (perfect communities). Good real-world community structure: Q ≈ 0.3–0.7.

3) Worked numerical example: computing modularity¶

Mini graph: 6 nodes, 7 edges, two suspected communities.

Community A: nodes {1, 2, 3}
Community B: nodes {4, 5, 6}
Edges: (1-2), (2-3), (1-3), (4-5), (5-6), (4-6), (3-4) ← bridge edge

Total edges m = 7. Total degree sum 2m = 14.

Degrees: k_1=2, k_2=2, k_3=3, k_4=3, k_5=2, k_6=2.

For one node pair in same community — (1,2) in community A:

A_12 = 1   (edge exists)
k_1 × k_2 / 2m = 2×2/14 = 0.286
contribution = (1 - 0.286) = 0.714

For cross-community pair (3,4):

A_34 = 1 (edge exists, bridge)
k_3 × k_4 / 2m = 3×3/14 = 0.643
δ(c_3, c_4) = 0 (different communities) → contribution = 0

The bridge edge contributes 0 to Q because nodes are in different communities. Full Q computation over all pairs gives ≈ 0.35 for this graph — decent structure.

4) Louvain algorithm: scalable community detection¶

Louvain runs in two phases, repeated:

Phase 1 — local optimisation:
  for each station v:
    try moving v to each neighbour's community
    keep the move that most increases Q
    repeat until no improvement

Phase 2 — aggregation:
  collapse each community into a super-node
  build new graph of super-nodes
  repeat Phase 1 on the smaller graph

Louvain is greedy and does not guarantee the global optimum. But it runs in O(n log n) and handles graphs with billions of entities.

After Louvain: each entity has a community ID. Generate an LLM summary for each community. Store the summary as a searchable document.

5) How Microsoft GraphRAG uses communities¶

Microsoft's GraphRAG open-source project runs Louvain on extracted knowledge graphs and generates summaries at multiple granularities.

Level 0: fine-grained communities (small clusters)
Level 1: merged communities (medium)
Level 2: coarse communities (large districts)

Each level's summaries are stored. A query selects the granularity that matches its breadth. Narrow entity query → Level 0 (fine). Broad theme query → Level 2 (coarse, full districts).

┌─────────────────────────────────────────────┐
│  Query breadth            →  Summary level  │
│  specific entity          →  Level 0        │
│  department/cluster       →  Level 1        │
│  whole corpus theme       →  Level 2        │
└─────────────────────────────────────────────┘

Yes? The graph query engine at query time picks the right level automatically based on query scope.

Where this lives in the wild¶

Microsoft GraphRAG (open source) — hierarchical Louvain on document-extracted KGs; community summaries power the "global search" mode for enterprise document Q&A.
LinkedIn's industry taxonomy — company-industry-skill graphs are clustered so that recommendations stay within coherent professional communities.
Twitter/X trending topics — retweet graph communities identify topic clusters; engineers query community-level summaries to surface trending themes.
Spotify playlist graph — song-artist-genre graph communities detect music scenes; community embeddings power "fans also like" recommendations across genres.
Facebook's integrity team — coordinated inauthentic behaviour detection uses community detection to find densely-connected account clusters in the social graph.

Pause and recall¶

What does a modularity score of Q ≈ 0.35 tell you about community structure?
In the bridge-edge example, why did edge (3,4) contribute 0 to modularity?
What are the two phases of the Louvain algorithm?
Why do broad thematic queries benefit from coarser community summaries?

Interview Q&A¶

Q: Why use community detection for global queries instead of just embedding the whole graph? A: Embedding the whole graph compresses all content into one vector — it averages out thematic structure. Community summaries preserve distinct themes as separate retrievable documents. A thematic query retrieves the right district of the knowledge graph, not a blurry average.

Common wrong answer to avoid: "Larger embeddings would capture themes" — dimensionality helps but cannot recover the discrete thematic separation that communities provide.

Q: Why is Louvain preferred over spectral clustering for large knowledge graphs? A: Spectral clustering requires computing eigenvectors of the Laplacian matrix — O(n³) for naive implementations, impractical at millions of entities. Louvain is O(n log n) and handles graphs with hundreds of millions of nodes.

Common wrong answer to avoid: "Louvain is more accurate" — spectral methods are often more principled; Louvain wins on scalability, not accuracy.

Q: Why generate LLM summaries at multiple hierarchy levels instead of just one? A: Different queries need different levels of detail. A specific entity question needs a fine-grained community summary. A corpus-wide question needs a coarse summary. One level cannot serve both without missing either precision or breadth.

Common wrong answer to avoid: "Summaries at all levels are the same content" — the LLM synthesises different information depending on which community boundary it summarises.

Q: Why does the bridge edge between communities lower modularity? A: Modularity rewards edges inside communities and penalises edges between communities. A bridge edge with high degree nodes (k_3=3, k_4=3) has a large expected term k_i×k_j/2m, which the actual edge (A_ij=1) barely exceeds — net positive contribution is near zero.

Common wrong answer to avoid: "Bridge edges are simply removed" — they are kept in the graph but do not contribute to modularity scoring.

Apply now (5 min)¶

Exercise. Take a small graph of 8 nodes from a domain you know. Draw the edges. Identify two plausible communities by visual inspection. Compute the contribution of one internal edge and one bridge edge to modularity.

Sketch from memory. Draw the two-phase Louvain algorithm. Label what happens in Phase 1 (local optimisation) and Phase 2 (aggregation). Mark where community summaries are generated and stored.

Bridge. Communities answer broad "what themes exist?" questions. But many real queries need to chain multiple specific facts together — that's multi-hop reasoning, and it requires explicit path traversal over the knowledge graph. → 09-multi-hop-reasoning.md