-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[ISSUE-393] Add documentation for mongodb-cdc connector #395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2d76232
to
9388a3c
Compare
docs/content/tutorials/tutorial.md
Outdated
- "27017:27017" | ||
environment: | ||
- MONGO_INITDB_ROOT_USERNAME=mongouser | ||
- MONGO_INITDB_ROOT_PASSWORD=mongopw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the existing tutorial pages to have a better meaningful title. You can create a new page for mongodb, the title can be "Streaming ETL from MongoDB to Elasticsearch"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wuchong, I'll move the tutorial to that.
Besides, you can preview docs locally by following the instruction https://github.com/ververica/flink-cdc-connectors/blob/master/docs/README.md |
c7257ad
to
228b5b1
Compare
228b5b1
to
6fad791
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work @Jiabao-Sun , I left some comments.
This can improve the use of indexes by the copying manager and make copying more efficient. | ||
|
||
In the following example, the $match aggregation operator ensures that only documents in which the closed field is set to false are copied. | ||
`copy.existing.pipeline=[ { "$match": { "closed": "false" } } ]` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
- For MongoDB's Kafka Connector, it subscribe `Change Stream` of MongoDB. | ||
|
||
MongoDB's `oplog.rs` collection doesn't keep the changed record's update before state, so it's hard to extract the full document state by a single `oplog.rs` record and convert it to change log stream accepted by Flink (Insert Only, Upsert, All). | ||
Additionally, MongoDB 5 (released in July) has changed the oplog format, so the current Debezium connector cannot be used with it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
released in July 2021.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
<table class="colwidths-auto docutils"> | ||
<thead> | ||
<tr> | ||
<th class="text-left">BSON type<a href="https://docs.mongodb.com/manual/reference/bson-types/"></a></th> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain a bit why we map from BSON type at the front?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
postgres: | ||
image: debezium/example-postgres:1.1 | ||
ports: | ||
- "5432:5432" | ||
environment: | ||
- POSTGRES_PASSWORD=1234 | ||
- POSTGRES_DB=postgres | ||
- POSTGRES_USER=postgres | ||
- POSTGRES_PASSWORD=postgres | ||
mysql: | ||
image: debezium/example-mysql:1.1 | ||
ports: | ||
- "3306:3306" | ||
environment: | ||
- MYSQL_ROOT_PASSWORD=123456 | ||
- MYSQL_USER=mysqluser | ||
- MYSQL_PASSWORD=mysqlpw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove all content about postgres and mysql because they are not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
.password("flinkpw") | ||
.database("inventory") | ||
.collection("products") | ||
.deserializer(new StringDebeziumDeserializationSchema()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are going to replace the recommended deserizer to JsonDebeziumDeserializationSchema
which is more user-friendly for users. Could you take a look #410 and update this part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
2. Remove all content about postgres and mysql in mongodb-tutorials 3. Brief introduction to type conversion 4. Use JsonDebeziumDeserializationSchema instead of StringDebeziumDeserializationSchema
Hi @wuchong, The problem you mentioned above has been fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work @Jiabao-Sun .
LGTM.
MongoDB CDC Connector
The MongoDB CDC connector allows for reading snapshot data and incremental data from MongoDB. This document describes how to setup the MongoDB CDC connector to run SQL queries against MongoDB.
Dependencies
In order to setup the MongoDB CDC connector, the following table provides dependency information for both projects using a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundles.
Maven dependency
SQL Client JAR
Download flink-sql-connector-mongodb-cdc-2.1.0.jar and put it under
<FLINK_HOME>/lib/
.Setup MongoDB
Availability
MongoDB version
MongoDB version >= 3.6
We use change streams feature (new in version 3.6) to capture change data.
Cluster Deployment
replica sets or sharded clusters is required.
Storage Engine
WiredTiger storage engine is required.
Replica set protocol version
Replica set protocol version 1 (pv1) is required.
Starting in version 4.0, MongoDB only supports pv1. pv1 is the default for all new replica sets created with MongoDB 3.2 or later.
Privileges
changeStream
andread
privileges are required by MongoDB Kafka Connector.You can use the following example for simple authorization.
For more detailed authorization, please refer to MongoDB Database User Roles.
How to create a MongoDB CDC table
The MongoDB CDC table can be defined as following:
Note that
MongoDB's change event record doesn't have update before message. So, we can only convert it to Flink's UPSERT changelog stream.
An upsert stream requires a unique key, so we must declare
_id
as primary key.We can't declare other column as primary key, becauce delete operation do not contain's the key and value besides
_id
andsharding key
.Connector Options
'mongodb-cdc'
.eg. localhost:27017,localhost:27018
This is required only when MongoDB is configured to use authentication.
This is required only when MongoDB is configured to use authentication.
replicaSet=test&connectTimeoutMS=300000
none
orall
. When set tonone
, the connector reports an error and blocks further processing of the rest of the records when it encounters an error. When set toall
, the connector silently ignores any bad messages.This can improve the use of indexes by the copying manager and make copying more efficient. eg.
[{"$match": {"closed": "false"}}]
ensures that only documents in which the closed field is set to false are copied.Note:
heartbeat.interval.ms
is highly recommended to set a proper value larger than 0 if the collection changes slowly.The heartbeat event can pushing the
resumeToken
forward to avoidresumeToken
being expired when we recover the Flink job from checkpoint or savepoint.Features
Exactly-Once Processing
The MongoDB CDC connector is a Flink Source connector which will read database snapshot first and then continues to read change stream events with exactly-once processing even failures happen.
Snapshot When Startup Or Not
The config option
copy.existing
specifies whether do snapshot when MongoDB CDC consumer startup. Defaults totrue
.Snapshot Data Filters
The config option
copy.existing.pipeline
describing the filters when copying existing data.This can improve the use of indexes by the copying manager and make copying more efficient.
In the following example, the $match aggregation operator ensures that only documents in which the closed field is set to false are copied.
copy.existing.pipeline=[ { "$match": { "closed": "false" } } ]
Change Streams
We integrates the MongoDB's official Kafka Connector to read snapshot or change events from MongoDB and drive it by Debezium's
EmbeddedEngine
.Debezium's
EmbeddedEngine
provides a mechanism for running a single Kafka ConnectSourceConnector
within an application's process and it can drive any standard Kafka ConnectSourceConnector
properly even which is not provided by Debezium.We choose MongoDB's official Kafka Connector instead of the Debezium's MongoDB Connector cause they use a different change data capture mechanism.
oplog.rs
collection of each replicaset's master node.Change Stream
of MongoDB.MongoDB's
oplog.rs
collection doesn't keep the changed record's update before state, so it's hard to extract the full document state by a singleoplog.rs
record and convert it to change log stream accepted by Flink (Insert Only, Upsert, All).Additionally, MongoDB 5 (released in July) has changed the oplog format, so the current Debezium connector cannot be used with it.
Change Stream is a new feature provided by MongoDB 3.6 for replica sets and sharded clusters allows applications to access real-time data changes without the complexity and risk of tailing the oplog.
Applications can use change streams to subscribe to all data changes on a single collection, a database, or an entire deployment, and immediately react to them.
Lookup Full Document for Update Operations is a feature provided by Change Stream which can configure the change stream to return the most current majority-committed version of the updated document. Because of this, we can easily collect the latest full document and convert the change log to Flink's Upsert Changelog Stream.
By the way, Debezium's MongoDB change streams exploration mentioned by DBZ-435 is on roadmap.
If it's done, we can support both for users to choose.
DataStream Source
The MongoDB CDC connector can also be a DataStream source. You can create a SourceFunction as the following shows:
Data Type Mapping
Timestamp
Timestamp
TIMESTAMP_LTZ(3)
TIMESTAMP_LTZ(0)
ObjectId
UUID
Symbol
MD5
JavaScript
Regex
Line : ROW<type STRING, coordinates ARRAY<ARRAY< DOUBLE>>>
...
Reference