Skip to content

Conversation

danielaskdd
Copy link
Collaborator

@danielaskdd danielaskdd commented Aug 31, 2025

Enhanced Entity Relation Extraction Text Sanitization and Normalization

Summary

This PR significantly improves the text extraction and normalization pipeline in LightRAG by enhancing the sanitize_and_normalize_extracted_text function and updating its usage across entity and relationship extraction processes. The changes focus on better handling of quotes, symbols, and edge cases to improve the quality of extracted knowledge graph data.

This enhancement represents a significant improvement to LightRAG's text processing capabilities, making it more robust and suitable for production use with diverse document types and languages.

Key Changes

1. Enhanced Text Normalization (lightrag/utils.py)

Function Signature Update:

  • Changed sanitize_and_normalize_extracted_text(input_text: str, is_name=False) to sanitize_and_normalize_extracted_text(input_text: str, remove_inner_quotes=False)
  • The parameter name change from is_name to remove_inner_quotes better reflects the actual functionality

Major Improvements to normalize_extracted_info:

  1. HTML Tag Cleaning: Added removal of paragraph and line break tags (<p>, </p>, <br>, </br>)

  2. Chinese Symbol Conversion: Comprehensive conversion of Chinese full-width characters to half-width:

    • Full-width letters (A-Z, a-z) to half-width
    • Full-width numbers (0-9) to half-width
    • Full-width symbols (-, +, /, *) to half-width
    • Full-width space ( ) to regular space
  3. Enhanced Quote Handling:

    • Improved outer quote removal logic with safety checks
    • Added support for Chinese-style outer quotes removal
    • Better handling of nested quotes to prevent over-removal
  4. Numeric Content Filtering:

    • Filter out short numeric-only content (length < 3, digits only)
    • Filter out mixed numeric and dot content (e.g., "1.2.3", "12.3") with length < 6
  5. Better Error Prevention: Added safety checks to prevent empty results and handle edge cases

2. Updated Entity Extraction (lightrag/operate.py)

Enhanced Parameter Usage:

  • Updated _handle_single_entity_extraction to use remove_inner_quotes=True for entity names and types
  • Updated _handle_single_relationship_extraction to use remove_inner_quotes=True for source, target, and keywords

Improved Validation:

  • Enhanced entity type validation with character filtering for ["'", "(", ")", "<", ">", "|", "/", "\\"]
  • Better error handling and logging for invalid extractions

Improve extraction error handling and field validation

  • Add field count validation warnings
  • Fix relationship field count (5→6)

Technical Benefits

  1. Data Quality: Significantly improved extraction accuracy by handling various text formatting issues
  2. Consistency: Standardized text normalization across different input formats
  3. Robustness: Better handling of edge cases and malformed input
  4. Internationalization: Enhanced support for Chinese text processing
  5. Maintainability: Clearer parameter naming and improved code documentation

Impact on Knowledge Graph

These improvements will result in:

  • Cleaner entity and relationship names in the knowledge graph
  • Reduced duplicate entities caused by formatting variations
  • Better handling of multilingual content
  • More reliable text extraction from various document formats

Backward Compatibility

The changes maintain backward compatibility while providing enhanced functionality. The parameter rename is internal and doesn't affect the public API.

- Improve reduntant quotes in entity and relation name, type and keywords
- Add HTML tag cleaning and Chinese symbol conversion
- Filter out short numeric content and malformed text
- Enhance entity type validation with character filtering
• Add field count validation warnings
• Fix relationship field count (5→6)
• Change error logs to warnings
@danielaskdd danielaskdd closed this pull request by merging all changes into HKUDS:main in cdc4570 Aug 31, 2025
@danielaskdd danielaskdd deleted the normalize-entity-name branch August 31, 2025 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant