Skip to content

enrichment: Entity extraction produces misclassified types and duplicate entities #72

@jack-arturo

Description

@jack-arturo

Problem

The entity extraction pipeline produces misclassified entity types, duplicate/near-duplicate entities, and nonsensical entities that pollute the graph and degrade recall quality.

Examples from Real Data

Type misclassification:

  • entity:people:scalemath — ScaleMath is an organization, not a person
  • entity:people:completed — "completed" is not a person
  • entity:people:advocacy — "advocacy" is not a person
  • entity:people:involvement — "involvement" is not a person
  • entity:people:key-findings — "key-findings" is not a person
  • entity:people:deployed-automem — not a person
  • entity:people:config-file-approach — not a person
  • entity:people:recommended — not a person

Near-duplicate entities:

Nonsensical entities:

  • entity:people:word — what is this?
  • entity:people:ud83d-udc4d — this is a Unicode emoji escape, not a person
  • entity:people:falkor — partial extraction from "FalkorDB"

Impact

Each bad entity becomes a node in the graph that can be traversed during expansion, creating false connections between unrelated memories. The entity:people:alex node alone connects two completely different people (Alex Panagis and Alex Beck), causing major recall pollution.

Proposed Solutions

  1. Post-extraction validation: Run extracted entities through a validation step that checks for common patterns (single common words, Unicode escapes, possessive suffixes, etc.)

  2. Entity type verification: Cross-reference extracted entities against the content to verify type classification. "ScaleMath specializes in B2B SaaS" should not produce a people entity.

  3. Canonical entity merging: When a new entity is extracted that's a substring or variant of an existing entity (e.g. alex-panagis-founder when alex-panagis exists), merge into the canonical form.

  4. Minimum entity quality threshold: Reject single-word generic entities (entity:people:alex, entity:people:completed) when more specific alternatives exist.

  5. Periodic graph cleanup job: Admin endpoint that scans for and merges/removes low-quality entity nodes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions