refac: Enhanced Entity Relation Extraction Text Sanitization and Normalization #2031
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Enhanced Entity Relation Extraction Text Sanitization and Normalization
Summary
This PR significantly improves the text extraction and normalization pipeline in LightRAG by enhancing the
sanitize_and_normalize_extracted_text
function and updating its usage across entity and relationship extraction processes. The changes focus on better handling of quotes, symbols, and edge cases to improve the quality of extracted knowledge graph data.This enhancement represents a significant improvement to LightRAG's text processing capabilities, making it more robust and suitable for production use with diverse document types and languages.
Key Changes
1. Enhanced Text Normalization (
lightrag/utils.py
)Function Signature Update:
sanitize_and_normalize_extracted_text(input_text: str, is_name=False)
tosanitize_and_normalize_extracted_text(input_text: str, remove_inner_quotes=False)
is_name
toremove_inner_quotes
better reflects the actual functionalityMajor Improvements to
normalize_extracted_info
:HTML Tag Cleaning: Added removal of paragraph and line break tags (
<p>
,</p>
,<br>
,</br>
)Chinese Symbol Conversion: Comprehensive conversion of Chinese full-width characters to half-width:
Enhanced Quote Handling:
Numeric Content Filtering:
Better Error Prevention: Added safety checks to prevent empty results and handle edge cases
2. Updated Entity Extraction (
lightrag/operate.py
)Enhanced Parameter Usage:
_handle_single_entity_extraction
to useremove_inner_quotes=True
for entity names and types_handle_single_relationship_extraction
to useremove_inner_quotes=True
for source, target, and keywordsImproved Validation:
["'", "(", ")", "<", ">", "|", "/", "\\"]
Improve extraction error handling and field validation
Technical Benefits
Impact on Knowledge Graph
These improvements will result in:
Backward Compatibility
The changes maintain backward compatibility while providing enhanced functionality. The parameter rename is internal and doesn't affect the public API.