Skip to content

Conversation

wlwilliamx
Copy link
Collaborator

What problem does this PR solve?

Issue Number: close #1220

What is changed and how it works?

This pull request introduces a robust and deeply integrated event filtering system. This feature allows users to define flexible rules to include or exclude events from the replication stream based on event type (e.g., INSERT, CREATE TABLE), SQL patterns for DDL queries, and powerful expression-based filtering on DML column values.

The primary goal of this feature is to give users precise control over the data flow, enabling them to easily ignore changes from temporary tables, log tables, or specific business operations.

Core Architectural Design: Early Filtering

The core design principle of this refactoring is to move the filtering logic as early as possible into the event processing pipeline. Previous filtering methods often operated on fully constructed and materialized event objects, which incurred unnecessary computational and memory costs.

This new design refactors the core Filter interface to operate directly on lower-level primitives like chunk.Row instead of high-level model.RowChangedEvent objects. The filtering logic is injected directly into the row appending process of DMLEvent (AppendRow method), achieving a highly efficient "filter-while-building" model. This means that if a row is filtered out by a rule, it is discarded immediately and is never decoded or added to the final event batch, eliminating wasted resources at the source.


Detailed Breakdown of Key Changes

1. Core Filter Interface Refactoring (pkg/filter/filter.go)

To enable the "Early Filtering" design, the central Filter interface was fundamentally refactored:

  • DML Filtering Interface Change:

    • Old Interface: ShouldIgnoreDMLEvent(dml *model.RowChangedEvent, rawRow model.RowChangedDatums, tableInfo *model.TableInfo)
    • New Interface: ShouldIgnoreDML(dmlType common.RowType, preRow, row chunk.Row, ti *common.TableInfo)
    • Explanation: The new interface no longer depends on the high-level, fully wrapped model.RowChangedEvent. Instead, it operates directly on the underlying row data primitive, chunk.Row. This is the key change that allows the filtering logic to be applied at a much earlier stage of event construction, leading to significant performance gains.
  • DDL Filtering Interface Consolidation:

    • Old Interface: Had two separate methods, ShouldDiscardDDL and ShouldIgnoreDDLEvent.
    • New Interface: Merged into a single, streamlined method: ShouldIgnoreDDL(schema, table, query string, ddlType timodel.ActionType, tableInfo *timodel.TableInfo).
    • Explanation: This consolidates two similar functions into one, simplifying the API. The new signature is more explicit, taking DDL metadata (schema, table, query) directly as arguments, making it a purer and easier-to-use function.

2. Deep Integration of DML Filtering into Event Construction (pkg/common/event/dml_event.go)

  • The signature of the DMLEvent.AppendRow method was modified to accept a filter.Filter instance.
  • Inside the method, the provided filter is now used to immediately evaluate the row. If filter.ShouldIgnoreDML returns true, the row is discarded instantly, and subsequent logic for appending and counting is skipped.
  • Explanation: This is the core implementation of the "Early Filtering" design. The filtering check is no longer an after-the-fact process but is now a native step within the event construction itself, ensuring that only validated rows are processed.

3. Expression-Based Filter (dmlExprFilter) Enhancements (pkg/filter/expr_filter.go)

The component responsible for filtering DML by column values has received significant upgrades:

  • Improved Table Schema Cache: The internal tables map, used for caching table schemas, has been changed. Its key is now the int64 table ID instead of the string table name, and its value is the uint64 table version (UpdateTS) instead of the full TableInfo object.
  • Explanation: Using the table ID as the key is far more robust than using the table name. Table names can change via RENAME TABLE operations, while table IDs are immutable. This ensures that filtering rules continue to apply correctly even after a table is renamed.
  • The shouldSkipDML method signature was updated to align with the new Filter interface, accepting chunk.Row as input to filter on the raw data directly.

4. SQL Event Filter (sqlEventFilter) Adaptation (pkg/filter/sql_event_filter.go)

This filter handles rules based on event type (e.g., INSERT, CREATE TABLE) and SQL regex. It has been updated to support the new architecture:

  • The signatures for shouldSkipDDL and shouldSkipDML have been updated to no longer accept the full model.DDLEvent or model.RowChangedEvent structs.
  • Explanation: This change aligns the component with the new Filter interface. By accepting more primitive arguments (like schema, table name, DDL type), it reduces coupling between components.

5. eventservice Integration (pkg/eventservice/event_scanner.go)

To ensure the filter is active at the earliest point of data retrieval, the changes were propagated throughout the eventservice:

  • The eventScanner, when processing raw KV events, now passes the filter object down to the dmlProcessor.
  • The dmlProcessor, in turn, uses this filter when processing new transactions (processNewTransaction) and appending rows (appendRow).
  • Explanation: This demonstrates that the filtering logic is deeply integrated into the very front of the data processing pipeline. Filtering begins the moment eventScanner reads raw KV data and starts assembling DML events, maximizing efficiency.

6. Code Structure and Type Centralization (pkg/common/types.go)

  • The RowType enum (RowTypeInsert, RowTypeUpdate, RowTypeDelete) was moved from dml_event.go to the more generic pkg/common/types.go package.
  • Explanation: This is a classic code organization improvement. RowType is a fundamental type used across multiple modules (filter, sink, event). Elevating it to a common package prevents circular dependencies and improves the overall code structure.

7. Comprehensive Unit Testing

  • New Test Files: Dedicated tests for the new filtering logic were added, including pkg/filter/expr_filter_test.go and pkg/filter/sql_event_filter_test.go.
  • Updated Existing Tests: A large number of existing tests across the codebase (e.g., filter_test.go, event_scanner_test.go, sink tests) were updated to align with the new function signatures and filtering logic.
  • Explanation: All core functionality and interface changes are covered by extensive unit tests, ensuring the quality and correctness of this significant refactoring.

Check List

Tests

  • Unit test
  • Integration test
    Add event_filter to the light group of mysql, kafka, pulsar and storage.
  • Manual test
    Deployed upstream and downstream clusters locally and tested various functions of the filter.

Questions

Will it cause performance regression or break compatibility?

No.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

Add event filter for preventing specific DML and DDL events from being synchronized downstream.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Aug 6, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @wlwilliamx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the event filtering mechanism, moving the filtering logic earlier in the data processing pipeline to improve efficiency. It introduces a new, streamlined Filter interface that operates on lower-level data primitives, enabling a "filter-while-building" model for DML events. This enhancement provides more precise control over data replication by discarding unwanted events at their source, reducing computational and memory overhead.

Highlights

  • Core Filter Interface Refactoring: The central Filter interface has been fundamentally redesigned. ShouldIgnoreDMLEvent is replaced by ShouldIgnoreDML which operates on chunk.Row and common.RowType, and ShouldIgnoreDDLEvent/ShouldDiscardDDL are merged into a single ShouldIgnoreDDL method. This change enables earlier and more efficient filtering.
  • Early DML Filtering Integration: The DMLEvent.AppendRow method now accepts a filter.Filter instance, allowing DML events to be filtered immediately upon row appending. If a row is filtered out, it's discarded before full decoding and batching, significantly optimizing resource usage by implementing a 'filter-while-building' model.
  • Enhanced Expression-Based Filtering: The dmlExprFilter now uses table IDs and versions for caching table schemas, making it more robust against table renames and ensuring accurate filtering rules even when table metadata changes.
  • Streamlined SQL Event Filtering: The sqlEventFilter has been adapted to the new Filter interface, accepting more primitive arguments for DDL and DML events. This reduces coupling between components and simplifies the filtering API.
  • Deep Integration into Event Processing Pipeline: The filtering logic is now deeply integrated into the eventScanner and dmlProcessor components. This ensures that filtering occurs at the earliest possible stage of data retrieval and event assembly, maximizing efficiency.
  • Code Structure Improvement: The RowType enum, representing DML event types, has been moved from dml_event.go to a more generic pkg/common/types.go package. This centralizes common types and resolves potential circular dependencies.
  • Comprehensive Testing: New dedicated unit tests have been added for the updated filtering logic (pkg/filter/expr_filter_test.go, pkg/filter/sql_event_filter_test.go), and a large number of existing tests across the codebase have been updated to align with the new function signatures and filtering behavior.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a significant and well-executed refactoring of the event filtering mechanism. Moving the filtering logic to an earlier stage in the event processing pipeline is a solid approach for improving performance. The new Filter interface, operating on chunk.Row, is a key enabler for this and is cleanly implemented. The switch to using table IDs for caching in dmlExprFilter is a great improvement for robustness. The code is well-structured, and the addition of comprehensive unit tests is commendable.

I've found a couple of potential issues, one in the filter-helper tool and a logic concern in DDL filtering, along with some suggestions for code cleanup.

@wlwilliamx
Copy link
Collaborator Author

/check-issue-triage-complete

@wlwilliamx
Copy link
Collaborator Author

/CC @lidezhu @asddongmen

@ti-chi-bot ti-chi-bot bot requested review from asddongmen and lidezhu August 6, 2025 08:18
@wlwilliamx wlwilliamx self-assigned this Aug 6, 2025
@asddongmen
Copy link
Collaborator

Please fix all failed tests.

@ti-chi-bot ti-chi-bot bot added the lgtm label Aug 21, 2025
Copy link

ti-chi-bot bot commented Aug 21, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-08-21 14:30:35.456736559 +0000 UTC m=+536643.399912075: ☑️ agreed by flowbehappy.

Copy link

ti-chi-bot bot commented Aug 21, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: flowbehappy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the approved label Aug 21, 2025
@wlwilliamx
Copy link
Collaborator Author

/test pull-error-log-review

@ti-chi-bot ti-chi-bot bot merged commit 7baacdc into pingcap:master Aug 21, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Event Filter
4 participants