Skip to content

Conversation

Gezi-lzq
Copy link
Contributor

@Gezi-lzq Gezi-lzq commented Aug 31, 2025

Config

  • (Add) automq.table.topic.convert.value.type (String)

    • Explanation: Specifies how to parse Kafka record values. Supported values are raw, string, by_schema_id, and by_latest_schema. Schema Registry URL required for by_schema_id and by_latest_schema.
  • (Add) automq.table.topic.convert.key.type (String)

    • Explanation: Specifies how to parse Kafka record keys. Supported values are raw, string, by_schema_id, and by_latest_schema. Schema Registry URL required for by_schema_id and by_latest_schema.
  • (Add) automq.table.topic.convert.value.subject (String, Optional)

    • Explanation: The Schema Registry subject name for value schemas. Defaults to {topic-name}-value if not specified.
  • (Add) automq.table.topic.convert.value.message.full.name (String, Optional)

    • Explanation: The fully qualified message name for Protobuf value schemas. Used when schema contains multiple message types. Defaults to first message type.
  • (Add) automq.table.topic.convert.key.subject (String, Optional)

    • Explanation: The Schema Registry subject name for key schemas. Defaults to {topic-name}-key if not specified.
  • (Add) automq.table.topic.convert.key.message.full.name (String, Optional)

    • Explanation: The fully qualified message name for Protobuf key schemas. Used when schema contains multiple message types. Defaults to first message type.
  • (Add) automq.table.topic.transform.value.type (String)

    • Explanation: Transformation to apply to the record value after conversion. Supported values are none, flatten, and flatten_debezium.
      • none: No transformation applied.
      • flatten: Extract fields from structured records, promoting nested fields to top level.
      • flatten_debezium: Process Debezium CDC events, extracting before/after states based on operation type.
  • (Deprecated) automq.table.topic.schema.type (String)

    • Explanation: This configuration is deprecated. Use separate converter and transform configurations instead.
    • Migration: schema = convert.value.type=by_schema_id + transform.value.type=flatten

changelist:

  • Add RecordProcessorFactory to support dynamic creation of processors based on convert and transform configs
  • Introduce ConverterFactory for unified converter instantiation and management
  • Replace RegistryConverterFactory with more flexible ConverterFactory implementation
  • Add SchemaFormat enum to support different schema formats (Avro, Protobuf, Raw, String)
  • Implement TableTopicConvertType and TableTopicTransformType enums for config-driven processing
  • Enhance Converter interface to support separate key/value conversion with unified RecordData output
  • Add RecordAssembler for assembling processed records into final output format
  • Refactor AvroRegistryConverter and ProtobufRegistryConverter to work with new converter architecture
  • Add StringConverter for simple string-based conversions
  • Replace ValueUnwrapTransform with FlattenTransform for improved field extraction
  • Update DebeziumUnwrapTransform with enhanced CDC event processing
  • Add SchemalessTransform for handling raw record transformations
  • Remove obsolete KafkaRecordTransform and related classes
  • Update TopicConfig with new converter and transform configuration options
  • Add comprehensive unit tests for new processor factory and transforms
  • Include protobuf test schema and related test utilities

@Gezi-lzq Gezi-lzq force-pushed the feat/icberg-writer branch 4 times, most recently from 5211a93 to 80502e9 Compare August 31, 2025 11:59
… conversion/transform pipeline

- Add RecordProcessorFactory to support dynamic creation of processors based on schema and transform configs
- Refactor RegistryConverterFactory for improved schema format handling and converter instantiation
- Implement SchemaFormat, TableTopicConvertType, and TableTopicTransformType enums for config-driven processing
- Enhance Converter interface and conversion records to include key, value, and timestamp fields
- Refactor AvroRegistryConverter and ProtobufRegistryConverter to return unified RecordData objects
- Add ProtoToAvroConverter for robust Protobuf-to-Avro conversion
- Update transform chain: add KafkaMetadataTransform for metadata enrichment, refactor DebeziumUnwrapTransform
- Update DefaultRecordProcessor and TransformContext to support partition-aware processing
- Improve error handling and code clarity across conversion and transform modules
@Gezi-lzq Gezi-lzq force-pushed the feat/icberg-writer branch from 80502e9 to 0b078da Compare August 31, 2025 12:18
@Gezi-lzq Gezi-lzq changed the title feat(process): introduce flexible record processor factory and enrich conversion/transform pipeline feat(process): introduce record processor factory and enrich conversion/transform pipeline Aug 31, 2025
List<Schema.Field> originalFields = new ArrayList<>();
for (Schema.Field field : originalSchema.getFields()) {
// Create a new Schema.Field instance for each original field
originalFields.add(new Schema.Field(field.name(), field.schema(), field.doc(), field.defaultVal(), field.order()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need create a new Schema.Field here instead of reusing the old Schema.Field?

Copy link
Contributor Author

@Gezi-lzq Gezi-lzq Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the newly created Field must be used; using the existing one will result in failure. (f.position != -1)

for (Field f : fields) {
        if (f.position != -1) {
          throw new AvroRuntimeException("Field already used: " + f);
        }
        ....

https://github.com/apache/avro/blob/c53857b8c9694c2f9b8cb071dd2ef617d5cda8b7/lang/java/avro/src/main/java/org/apache/avro/Schema.java#L966-L986

@Gezi-lzq Gezi-lzq force-pushed the feat/icberg-writer branch 2 times, most recently from 6301e2f to 046fd8f Compare September 3, 2025 12:11
@Gezi-lzq Gezi-lzq requested a review from Copilot September 3, 2025 12:28
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a comprehensive record processing factory and transformation pipeline that replaces the previous simplified schema-based approach with a flexible converter/transform architecture. The changes add support for granular configuration of key/value conversion and transformation types while maintaining backward compatibility with deprecated schema-based configurations.

Key changes:

  • Introduced a RecordProcessorFactory for dynamic processor creation based on configuration
  • Added new converter types (RAW, STRING, BY_SCHEMA_ID, BY_LATEST_SCHEMA) and transform types (NONE, FLATTEN, FLATTEN_DEBEZIUM)
  • Replaced the monolithic schema type configuration with separate key/value conversion and transformation settings

Reviewed Changes

Copilot reviewed 42 out of 42 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
LogConfig.java Added new table topic converter and transform configuration options
TableTopicConvertType.java Enum defining converter types (raw, string, schema-based)
TableTopicTransformType.java Enum defining transformation types (none, flatten, debezium)
TableTopicSchemaType.java Added NONE option to existing schema type enum
RecordProcessorFactoryTest.java Comprehensive test suite for the new processor factory
Various converter classes New converter implementations for different data types
Various transform classes Updated and new transform implementations
Test classes Updated test classes to work with new architecture
Comments suppressed due to low confidence (1)

core/src/test/java/kafka/automq/table/process/RecordProcessorFactoryTest.java:1

  • [nitpick] Inconsistent case handling for enum names. Other test methods use the enum name directly (e.g., TableTopicTransformType.NONE.name) while this line applies toLowerCase(). Consider using consistent approach throughout the test class.
/*

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@Gezi-lzq Gezi-lzq merged commit 01e5371 into main Sep 5, 2025
6 checks passed
@Gezi-lzq Gezi-lzq deleted the feat/icberg-writer branch September 5, 2025 02:29
Gezi-lzq added a commit that referenced this pull request Sep 5, 2025
…on/transform pipeline (#2796)

* feat(process): introduce flexible record processor factory and enrich conversion/transform pipeline

- Add RecordProcessorFactory to support dynamic creation of processors based on schema and transform configs
- Refactor RegistryConverterFactory for improved schema format handling and converter instantiation
- Implement SchemaFormat, TableTopicConvertType, and TableTopicTransformType enums for config-driven processing
- Enhance Converter interface and conversion records to include key, value, and timestamp fields
- Refactor AvroRegistryConverter and ProtobufRegistryConverter to return unified RecordData objects
- Add ProtoToAvroConverter for robust Protobuf-to-Avro conversion
- Update transform chain: add KafkaMetadataTransform for metadata enrichment, refactor DebeziumUnwrapTransform
- Update DefaultRecordProcessor and TransformContext to support partition-aware processing
- Improve error handling and code clarity across conversion and transform modules
Gezi-lzq added a commit that referenced this pull request Sep 5, 2025
…on/transform pipeline (#2796)

* feat(process): introduce flexible record processor factory and enrich conversion/transform pipeline

- Add RecordProcessorFactory to support dynamic creation of processors based on schema and transform configs
- Refactor RegistryConverterFactory for improved schema format handling and converter instantiation
- Implement SchemaFormat, TableTopicConvertType, and TableTopicTransformType enums for config-driven processing
- Enhance Converter interface and conversion records to include key, value, and timestamp fields
- Refactor AvroRegistryConverter and ProtobufRegistryConverter to return unified RecordData objects
- Add ProtoToAvroConverter for robust Protobuf-to-Avro conversion
- Update transform chain: add KafkaMetadataTransform for metadata enrichment, refactor DebeziumUnwrapTransform
- Update DefaultRecordProcessor and TransformContext to support partition-aware processing
- Improve error handling and code clarity across conversion and transform modules
superhx pushed a commit that referenced this pull request Sep 5, 2025
…on/transform pipeline (#2796)

* feat(process): introduce flexible record processor factory and enrich conversion/transform pipeline

- Add RecordProcessorFactory to support dynamic creation of processors based on schema and transform configs
- Refactor RegistryConverterFactory for improved schema format handling and converter instantiation
- Implement SchemaFormat, TableTopicConvertType, and TableTopicTransformType enums for config-driven processing
- Enhance Converter interface and conversion records to include key, value, and timestamp fields
- Refactor AvroRegistryConverter and ProtobufRegistryConverter to return unified RecordData objects
- Add ProtoToAvroConverter for robust Protobuf-to-Avro conversion
- Update transform chain: add KafkaMetadataTransform for metadata enrichment, refactor DebeziumUnwrapTransform
- Update DefaultRecordProcessor and TransformContext to support partition-aware processing
- Improve error handling and code clarity across conversion and transform modules
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants