refac: Enhanced Entity Relation Extraction Text Sanitization and Normalization #2031

danielaskdd · 2025-08-31T05:24:34Z

Enhanced Entity Relation Extraction Text Sanitization and Normalization

Summary

This PR significantly improves the text extraction and normalization pipeline in LightRAG by enhancing the sanitize_and_normalize_extracted_text function and updating its usage across entity and relationship extraction processes. The changes focus on better handling of quotes, symbols, and edge cases to improve the quality of extracted knowledge graph data.

This enhancement represents a significant improvement to LightRAG's text processing capabilities, making it more robust and suitable for production use with diverse document types and languages.

Key Changes

1. Enhanced Text Normalization (`lightrag/utils.py`)

Function Signature Update:

Changed sanitize_and_normalize_extracted_text(input_text: str, is_name=False) to sanitize_and_normalize_extracted_text(input_text: str, remove_inner_quotes=False)
The parameter name change from is_name to remove_inner_quotes better reflects the actual functionality

Major Improvements to normalize_extracted_info:

HTML Tag Cleaning: Added removal of paragraph and line break tags (<p>, </p>, <br>, </br>)
Chinese Symbol Conversion: Comprehensive conversion of Chinese full-width characters to half-width:
- Full-width letters (A-Z, a-z) to half-width
- Full-width numbers (０-９) to half-width
- Full-width symbols (－, ＋, ／, ＊) to half-width
- Full-width space (　) to regular space
Enhanced Quote Handling:
- Improved outer quote removal logic with safety checks
- Added support for Chinese-style outer quotes removal
- Better handling of nested quotes to prevent over-removal
Numeric Content Filtering:
- Filter out short numeric-only content (length < 3, digits only)
- Filter out mixed numeric and dot content (e.g., "1.2.3", "12.3") with length < 6
Better Error Prevention: Added safety checks to prevent empty results and handle edge cases

2. Updated Entity Extraction (`lightrag/operate.py`)

Enhanced Parameter Usage:

Updated _handle_single_entity_extraction to use remove_inner_quotes=True for entity names and types
Updated _handle_single_relationship_extraction to use remove_inner_quotes=True for source, target, and keywords

Improved Validation:

Enhanced entity type validation with character filtering for ["'", "(", ")", "<", ">", "|", "/", "\\"]
Better error handling and logging for invalid extractions

Improve extraction error handling and field validation

Add field count validation warnings
Fix relationship field count (5→6)

Technical Benefits

Data Quality: Significantly improved extraction accuracy by handling various text formatting issues
Consistency: Standardized text normalization across different input formats
Robustness: Better handling of edge cases and malformed input
Internationalization: Enhanced support for Chinese text processing
Maintainability: Clearer parameter naming and improved code documentation

Impact on Knowledge Graph

These improvements will result in:

Cleaner entity and relationship names in the knowledge graph
Reduced duplicate entities caused by formatting variations
Better handling of multilingual content
More reliable text extraction from various document formats

Backward Compatibility

The changes maintain backward compatibility while providing enhanced functionality. The parameter rename is internal and doesn't affect the public API.

- Improve reduntant quotes in entity and relation name, type and keywords - Add HTML tag cleaning and Chinese symbol conversion - Filter out short numeric content and malformed text - Enhance entity type validation with character filtering

• Add field count validation warnings • Fix relationship field count (5→6) • Change error logs to warnings

danielaskdd added 4 commits August 31, 2025 10:36

refactor: Merge multi-step text sanitization into single function

d4bbc5d

Improve extraction error handling and field validation

97c9600

• Add field count validation warnings • Fix relationship field count (5→6) • Change error logs to warnings

Fix typo in relationship extraction log messages

75de40d

danielaskdd closed this pull request by merging all changes into HKUDS:main in cdc4570 Aug 31, 2025

danielaskdd deleted the normalize-entity-name branch August 31, 2025 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refac: Enhanced Entity Relation Extraction Text Sanitization and Normalization #2031

refac: Enhanced Entity Relation Extraction Text Sanitization and Normalization #2031

Uh oh!

danielaskdd commented Aug 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

refac: Enhanced Entity Relation Extraction Text Sanitization and Normalization #2031

refac: Enhanced Entity Relation Extraction Text Sanitization and Normalization #2031

Uh oh!

Conversation

danielaskdd commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Enhanced Entity Relation Extraction Text Sanitization and Normalization

Summary

Key Changes

1. Enhanced Text Normalization (lightrag/utils.py)

2. Updated Entity Extraction (lightrag/operate.py)

Technical Benefits

Impact on Knowledge Graph

Backward Compatibility

Uh oh!

Uh oh!

danielaskdd commented Aug 31, 2025 •

edited

Loading

1. Enhanced Text Normalization (`lightrag/utils.py`)

2. Updated Entity Extraction (`lightrag/operate.py`)