Skip to content

Conversation

kdy1
Copy link
Member

@kdy1 kdy1 commented Sep 6, 2025

Description:

@kdy1 kdy1 added this to the Planned milestone Sep 6, 2025
@kdy1 kdy1 self-assigned this Sep 6, 2025
Copy link

changeset-bot bot commented Sep 6, 2025

🦋 Changeset detected

Latest commit: 1db02a1

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Contributor

claude bot commented Sep 6, 2025

Claude finished @kdy1's task —— View job


PR Review: Rebase merge of AST-breaking changes

I'll provide comprehensive feedback on this large PR focusing on code quality, potential bugs, performance, security, and test coverage.

Review Tasks:

  • Analyze AST changes and breaking modifications
  • Review Unicode handling improvements in lexer
  • Examine string literal processing changes
  • Check node ID additions to identifiers
  • Assess potential security implications
  • Evaluate test coverage for new functionality
  • Review performance impact of changes
  • Provide comprehensive feedback

🔍 Comprehensive Review Results

Strengths

  1. Robust Unicode Handling: The improvements to Unicode lone surrogate handling are well-implemented:

    • Proper distinction between valid Unicode characters and lone surrogates
    • Safe encoding using replacement character \u{FFFD}
    • Comprehensive test coverage including edge cases like \uD800
  2. AST Breaking Changes Are Well-Structured:

    • NodeId addition to Ident provides proper node tracking
    • Clean NodeId wrapper around u32 with sensible defaults
    • Proper serialization support across all formats (serde, rkyv, arbitrary)
  3. Backward Compatibility Measures:

    • Default values maintain existing behavior where possible
    • All string literal constructors default lone_surrogates: false
    • Raw string preservation logic maintains performance for common cases

🔧 Code Quality Issues

  1. Inconsistent Error Handling (crates/swc_ecma_codegen/src/lit.rs:56):

    // Good: Checks lone_surrogates before using raw
    && !self.lone_surrogates

    However, this pattern should be consistently applied everywhere raw strings are used.

  2. Magic Numbers (crates/swc_common/src/node_id.rs:14):

    pub const DUMMY: NodeId = NodeId(u32::MAX);

    Consider adding documentation explaining why u32::MAX was chosen as the dummy value.

⚠️ Potential Security Concerns

  1. Unicode Normalization: The lexer properly handles lone surrogates but doesn't address Unicode normalization attacks. However, this is likely outside the scope of this PR.

  2. String Processing Safety: The new lone_surrogates field correctly prevents invalid Unicode from being passed through as raw strings, which is a security improvement.

🚀 Performance Considerations

  1. Memory Impact: Each Ident now carries an additional 4 bytes (NodeId). With thousands of identifiers in large files, this could add up.

  2. Codegen Performance: The lone surrogate checks add minimal overhead but are properly optimized with early returns.

  3. CodSpeed Results: The PR shows no performance regression, which is excellent given the scope of changes.

🧪 Test Coverage Assessment

Excellent coverage observed:

  • Comprehensive Unicode test cases including edge cases (unicodeExtendedEscapesInStrings*)
  • Lone surrogate specific tests (lone_surrogates: true cases in lexer tests)
  • Both ES5 and ES6 target compatibility tests
  • JSX string handling with Unicode characters

🏗️ Breaking Change Impact

This is properly marked as a major version bump due to:

  • Ident struct changes requiring recompilation of all dependent code
  • String literal structure changes affecting AST consumers
  • New required fields in various constructors

📝 Minor Recommendations

  1. Documentation: Consider adding examples of lone surrogate handling to the module docs.

  2. Consistency: Ensure all Str::from() implementations consistently set lone_surrogates: false.

  3. Error Messages: Consider improving error messages when lone surrogates are encountered to help developers understand the issue.

🎯 Overall Assessment: APPROVED

This PR represents a significant and well-executed improvement to Unicode handling in SWC. The breaking changes are necessary and properly implemented, with excellent test coverage and no performance regressions. The code quality is high with only minor areas for improvement.

Key strengths:

  • Proper Unicode security handling
  • Comprehensive test suite
  • Clean API design
  • Performance-conscious implementation
  • Proper handling of edge cases

The changes successfully address Unicode lone surrogate handling while maintaining backward compatibility where possible.


Copy link

codspeed-hq bot commented Sep 6, 2025

CodSpeed Performance Report

Merging #11065 will degrade performances by 2.32%

Comparing dev/rust (1db02a1) with main (ceb0aff)1

Summary

❌ 2 regressions
✅ 138 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
es2015_destructuring 347.9 µs 355.1 µs -2.03%
es/target/es2017 275.6 µs 282.1 µs -2.32%

Footnotes

  1. No successful run was found on main (d6e0a49) during the generation of this report, so ceb0aff was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

kdy1 and others added 2 commits September 6, 2025 05:03
**Description:**

Cherry pick e39a909 from #10739 because we are going to make a Wasm-breaking change anyway

---------

Co-authored-by: magic-akari <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants