Skip to content

Conversation

Jiabao-Sun
Copy link
Contributor

@Jiabao-Sun Jiabao-Sun commented Jun 29, 2021

Support MongoDB CDC connector of Flink 1.13.0 #11

Summary

Use Debezium EmbeddedEngine to drive a SourceConnector of MongoDB Kafka Connector.
MongoDB Kafka Connector's CDC mechanism is MongoDB ChangeSteams which is available for replica sets and sharded clusters.

For a fully detailed documentation of MongoDB Kafka Connector , please see Documentation.


Example

前置条件

  • MongoDB Version > 3.6 .
  • MongoDB run in replica sets or sharded clusters mode.
  • changeStream and read privileges required. Database-User-Roles

Create a user:

use admin;
db.createUser(
 {
   user: "flinkuser",
   pwd: "flinkpw",
   roles: [
      { role: "read", db: "admin" },
      { role: "readAnyDatabase", db: "admin" }
   ]
 }
);

Flink SQL Example:

CREATE TABLE mongodb_source (
   _id STRING PRIMARY KEY, #upsert stream, primary key required
   name STRING,
   description STRING,
   weight DECIMAL(10,3)
) WITH (
   'connector' = 'mongodb-cdc',
   'hosts' = '127.0.0.1:27017',
   'username' = 'flinkuser',
   'password' = 'flinkpw',
   'database' = 'inventory',
   'collection' = 'products'
);

SELECT _id, name, description, weight FROM mongodb_source;

DataStream Example:

import com.alibaba.ververica.cdc.debezium.StringDebeziumDeserializationSchema;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;

public class MongoDBSourceExample {
    public static void main(String[] args) throws Exception {
        SourceFunction<String> sourceFunction = MongoDBSource.<String>builder()
                .hosts("127.0.0.1:27017")
                .username("flinkuser")
                .password("flinkpw")
                .database("inventory")
                .collection("products")
                .deserializer(new StringDebeziumDeserializationSchema())
                .build();

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env.addSource(sourceFunction)
                .print().setParallelism(1); // use parallelism 1 for sink to keep message ordering

        env.execute();
    }
}

Connector Options

Name Description Required Default
hosts The comma-separated list of hostname and port pairs of the MongoDB servers.
eg. localhost:27017,localhost:27018
true -
username Name of the database user to be used when connecting to MongoDB.
This is required only when MongoDB is configured to use authentication.
false -
password Password to be used when connecting to MongoDB.
This is required only when MongoDB is configured to use authentication.
false -
database Name of the database to watch for changes. true -
collection Name of the collection in the database to watch for changes. true -
connection.options The ampersand-separated connection options of MongoDB. eg. replicaSet=test&connectTimeoutMS=300000. false -
errors.tolerance Whether to continue processing messages if an error is encountered. false none
errors.log.enable Whether details of failed operations should be written to the log file. false true
copy.existing Whether copy existing data from source collections false true
copy.existing.pipeline An array of JSON objects describing the pipeline operations to run when copying existing data. This can improve the use of indexes by the copying manager and make copying more efficient. eg. [{"$match": {"closed": "false"}}] ensures that only documents in which the closed field is set to false are copied. false -
copy.existing.max.threads The number of threads to use when performing the data copy. false Processors Count
copy.existing.queue.size The max size of the queue to use when copying data. false 16000
poll.max.batch.size Maximum number of change stream documents to include in a single batch when polling for new data. false 1000
poll.await.time.ms The amount of time to wait before checking for new results on the change stream. false 1500
heartbeat.interval.ms The length of time in milliseconds between sending heartbeat messages. Use 0 to disable. false 0

Reference

MongoDB Kafaka Connector
MongoDB Change Streams


@Jiabao-Sun
Copy link
Contributor Author

Hi, Jark. @wuchong
Do you have any suggestions?

@wuchong
Copy link
Member

wuchong commented Jun 29, 2021

@Jiabao-Sun , I went through the API and examples, and it looks very well. I think it's already in a good shape.
I will find time to review it, but quite busy this week.

@carloscbl
Copy link

I've compiled and tested in my env and seems to be working good! flink 1.13

@Jiabao-Sun
Copy link
Contributor Author

I've compiled and tested in my env and seems to be working good! flink 1.13

Thanks for your reply.

@carloscbl
Copy link

org.apache.flink.table.api.ValidationException: Table 'MongoDB-CDC' declares metadata columns, but the underlying DynamicTableSource doesn't implement the SupportsReadingMetadata interface. Therefore, metadata cannot be read from the given source.

Is this intended?

@Jiabao-Sun
Copy link
Contributor Author

org.apache.flink.table.api.ValidationException: Table 'MongoDB-CDC' declares metadata columns, but the underlying DynamicTableSource doesn't implement the SupportsReadingMetadata interface. Therefore, metadata cannot be read from the given source.

Is this intended?

org.apache.flink.table.api.ValidationException: Table 'MongoDB-CDC' declares metadata columns, but the underlying DynamicTableSource doesn't implement the SupportsReadingMetadata interface. Therefore, metadata cannot be read from the given source.

Is this intended?

@carloscbl Currently this ability has not been realized, but I think it's a helpful suggestion.

We can provide some metadata such as Debezium Format describes.

Key Data Type Description
schema STRING NULL JSON string describing the schema of the payload. Null if the schema is not included in the Debezium record.
ingestion-timestamp TIMESTAMP_LTZ(3) NULL The timestamp at which the connector processed the event. Corresponds to the ts_ms field in the Debezium record.
source.timestamp TIMESTAMP_LTZ(3) NULL The timestamp at which the source system created the event. Corresponds to the source.ts_ms field in the Debezium record.
source.database STRING NULL The originating database. Corresponds to the source.db field in the Debezium record if available.
source.schema STRING NULL The originating database schema. Corresponds to the source.schema field in the Debezium record if available.
source.table STRING NULL The originating database table. Corresponds to the source.table or source.collection field in the Debezium record if available.

CC @wuchong @leonardBang
If you think this is a necessary feature, I'm happy to open a new PR to deal with it.

@yangxusun9
Copy link

when i cancle a job which is mongocdc stream join kafka stream , i found mongocdc stream can't be closed .There are some pictures to show:
image
TM log
image
then i check the pom.xml ,i found you used mongo-kafka-connect ,but not debezium-connector-mongodb ,Can you tell me why . i think this is the reason to my question

@Jiabao-Sun
Copy link
Contributor Author

@yangxusun9

Q1.

Maybe relate to #72.

Please help check:

  • Snapshot phase is done.
  • Whether the source table is changing rapidly.
  • Is the execution.checkpointing.interval set too short causing the checkpoint to be failed all the time.

Q2.

Debezium mongodb connector cannot extract full document of a changed record.
So it's difficult to transform to flink's upsert stream.
Please see the previous PR for detail PR-123.

@yangxusun9
Copy link

@Jiabao-Sun Thanks for your reply!
to Q1,I'm sure that Snapshot phase is done but the source table is changing rapidly .you means it can't be closed if the source table is changing rapidly

@leonardBang leonardBang linked an issue Aug 20, 2021 that may be closed by this pull request
@Jiabao-Sun
Copy link
Contributor Author

@Jiabao-Sun Thanks for your reply!
to Q1,I'm sure that Snapshot phase is done but the source table is changing rapidly .you means it can't be closed if the source table is changing rapidly

@yangxusun9
If you cancel the flink job with a savepoint, must acquire the checkpoint lock first.
Rapidly changing or too short checkpointing interval may reduce the possibility of acquiring the checkpoint lock.
In addition, long time savepoint will also increase the time to cancel the task.

However, there may be other reasons that cause this problem.
For example, the operators were not interrupted or destroyed as expected.

If you can reproduce this problem, is it convenient to provide a detailed log for troubleshooting?

@leonardBang leonardBang requested a review from wuchong August 23, 2021 01:26
@Jiabao-Sun Jiabao-Sun force-pushed the flink-connector-mongodb-cdc-latest branch 2 times, most recently from 4221a10 to e7bcb88 Compare August 30, 2021 18:20
@Jiabao-Sun
Copy link
Contributor Author

Upgrade mongo-kafka-connect from 1.4.0 to 1.6.1 and mongo-driver from 4.2.1 to 4.3.1 to support MongoDB 5.0.

@Jiabao-Sun Jiabao-Sun force-pushed the flink-connector-mongodb-cdc-latest branch from e7bcb88 to 0c5a150 Compare August 31, 2021 05:31
@Jiabao-Sun Jiabao-Sun force-pushed the flink-connector-mongodb-cdc-latest branch from 0c5a150 to 53ff887 Compare August 31, 2021 07:35
@yangxusun9
Copy link

@Jiabao-Sun Thanks for your reply!
to Q1,I'm sure that Snapshot phase is done but the source table is changing rapidly .you means it can't be closed if the source table is changing rapidly

@yangxusun9
If you cancel the flink job with a savepoint, must acquire the checkpoint lock first.
Rapidly changing or too short checkpointing interval may reduce the possibility of acquiring the checkpoint lock.
In addition, long time savepoint will also increase the time to cancel the task.

However, there may be other reasons that cause this problem.
For example, the operators were not interrupted or destroyed as expected.

If you can reproduce this problem, is it convenient to provide a detailed log for troubleshooting?

It is solved because of the problem of using the platform. If you directly call cancel command, it can be cancelled normally
Thanks for your help!

@Jiabao-Sun
Copy link
Contributor Author

@Jiabao-Sun Thanks for your reply!
to Q1,I'm sure that Snapshot phase is done but the source table is changing rapidly .you means it can't be closed if the source table is changing rapidly

@yangxusun9
If you cancel the flink job with a savepoint, must acquire the checkpoint lock first.
Rapidly changing or too short checkpointing interval may reduce the possibility of acquiring the checkpoint lock.
In addition, long time savepoint will also increase the time to cancel the task.
However, there may be other reasons that cause this problem.
For example, the operators were not interrupted or destroyed as expected.
If you can reproduce this problem, is it convenient to provide a detailed log for troubleshooting?

It is solved because of the problem of using the platform. If you directly call cancel command, it can be cancelled normally
Thanks for your help!

Hi @yangxusun9
Thank you for your feedback on this issue.
In order to prevent others from encountering the same problem, is it convenient to tell what framework is used, zeppelin or others?

Copy link
Member

@wuchong wuchong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Jiabao-Sun for the great and high-quality work! The pull request looks good to me in general. I don't have much knowledge about MongoDB, so I only left some comments from Flink side.

Cheers!

// then exit the snapshot phase.
if (!isCopying()) {
outSourceRecords =
Collections.singletonList(markLastSnapshotRecord(lastSnapshotRecord));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastSnapshotRecord may be null here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Jark
lastSnapshotRecord may be null when the source collection is empty and a short poll waiting time is set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

// Snapshot Phase Ended, Condition 1:
// Received non-snapshot record, exit snapshot phase immediately.
if (lastSnapshotRecord != null) {
outSourceRecords.add(markLastSnapshotRecord(lastSnapshotRecord));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we duplicate emit the lastSnapshotRecord here?

Copy link
Contributor Author

@Jiabao-Sun Jiabao-Sun Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I checked the code twice and it seems we don't duplicate emit the lastSnapshotRecord.
The lastSnapshotRecord is emitted may be snapshot record of previous loop or the last snapshot record of previous batch.
The lastSnapshotRecord has been renamed to currentLastSnapshotRecord to eliminate ambiguity.

"connector.class", MongoDBConnectorSourceConnector.class.getCanonicalName());
props.setProperty("name", "mongodb_binlog_source");

props.setProperty(MongoSourceConfig.CONNECTION_URI_CONFIG, checkNotNull(connectionUri));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we didn't set mongodb.name which is used to identify replica?

https://debezium.io/documentation/reference/1.5/connectors/mongodb.html#mongodb-property-mongodb-name

Copy link
Contributor Author

@Jiabao-Sun Jiabao-Sun Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry Jark, we didn't choose Debezium's MongoDB Connector but MongoDB's Kafka Connector.
https://docs.mongodb.com/kafka-connector/current/kafka-source/

The mongodb.name property is used for kafka topics prefix as topic.prefix in MongoDB's Kafka Connector and I think it's not required for our flink cdc connector.

The mechanisms of the two connectors are different:

  • For Debezium's MongoDB Connector, it read the oplog.rs collection of each replica set master node.
  • For MongoDB's Kafka Connector, it subscribe change stream of MongoDB.

MongoDB's oplog.rs collection didn't keep the changed record's update before state, so it's hard to extract the full document state by one oplog.rs's record and convert it to Flink's upsert changelog.
Additionally, MongoDB 5 (released in July) has changed the oplog format, so the current Debezium connector cannot be used with it.

The MongoDB's change streams provided a updateLookup feature to return the most current majority-committed version of the updated document. It's easy to convert change stream event to Flink's upsert changelog.
https://docs.mongodb.com/manual/changeStreams/
https://docs.mongodb.com/manual/reference/change-events/

By the way, Debezium's MongoDB change streams exploration is on roadmap.
If it's done, we can provide two engine for users to choose.
https://issues.redhat.com/browse/DBZ-435

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation.

* A builder to build a SourceFunction which can read snapshot and continue to consume change stream
* events.
*/
public class MongoDBSource {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add @PublicEvolving annotation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

+ " md5Field STRING,\n"
+ " dateField DATE,\n"
+ " dateBefore1970 DATE,\n"
+ " timestampField TIMESTAMP,\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test for TIMESTAMP_LTZ as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

return TimestampData.fromEpochMillis(docObj.asDateTime().getValue());
}
if (docObj.isTimestamp()) {
return TimestampData.fromEpochMillis(docObj.asTimestamp().getTime() * 1000L);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the underlying timestamp value means? Is there any documentation to reference?

Usually, TimestampData.fromEpochMillis is used for TIMESTAMP_LTZ types where the epoch milliseconds represent Java Instant. We recommend to use TimestampData.fromLocalDateTime for TIMESTAMP types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.mongodb.com/manual/reference/bson-types/

Timestamp

BSON has a special timestamp type for internal MongoDB use and is not associated with the regular Date type. This internal timestamp type is a 64 bit value where:

  • the most significant 32 bits are a time_t value (seconds since the Unix epoch)
  • the least significant 32 bits are an incrementing ordinal for operations within a given second.

The BSON timestamp type is represented by a java long type. High 32 bit's represent unix epoch seconds and low 32 bits's represent an incrementing value to keep time increments and avoid duplication.


DateTime

BSON Date is a 64-bit integer that represents the number of milliseconds since the Unix epoch (Jan 1, 1970). This results in a representable date range of about 290 million years into the past and future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jark to for the detailed explanation of TIMESTAMP_LTZ.
As the documentation mentioned above, they are semantically equivalent:

  • BSON DateTime - TIMESTAMP_LTZ(3)
  • BSON Timestamp -TIMESTAMP_LTZ(0)

We treated BSON Timestamp as TIMESTAMP_LTZ(0) because it's minimum precision is seconds and the incrementing value does not represent the exact time interval.

So, I think the transformation in the code can work.
If there is a deviation in my understanding, please help to correct me.

Copy link
Member

@wuchong wuchong Sep 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Jiabao-Sun , I think it makes sense to map BSON DateTime and Timestamp to Flink TIMESTAMP_LTZ because they all represents the seconds since epoch.

However, if we map them to Flink TIMESTAMP (i.e. TIMESTAMP WITHOUT TIME ZONE), then we need to do a conversion (epoch seconds to local clock time) just like Java Instant to Java LocalDateTime. This conversion requires a time zone which means what time zone to dispaly the instant in string. Usually, the time zone is configured by users and not hard code, e.g. table.local-time-zone in Flink SQL , and server-time-zone in mysql-cdc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wuchong, I'll add a server-time-zone configuration for users to configure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wuchong for the detailed explanation.

I added a new config option local-time-zone to let users decide which time zone to convert to TIMESTAMP.

I think this is semantically more similar to table.local-time-zone in Flink SQL rather than server-time-zone of MySQL, so I named it as local-time-zone.

By the way, is there a method provided by Flink to get the table.local-time-zone at runtime, so we may not need to add additional config option.

Copy link
Member

@wuchong wuchong Sep 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can get it by org.apache.flink.table.factories.DynamicTableFactory.Context#getConfiguration in table factory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wuchong.

Removed table option 'local-time-zone', use time zone of table.local-time-zone at runtime.

@wuchong
Copy link
Member

wuchong commented Sep 3, 2021

Besides, I created #393 for the mongodb-cdc connector, would like to take it as well ?

@Jiabao-Sun
Copy link
Contributor Author

Besides, I created #393 for the mongodb-cdc connector, would like to take it as well ?

Thanks, be happy to do with it.

@Jiabao-Sun Jiabao-Sun requested a review from wuchong September 3, 2021 19:24
@Jiabao-Sun
Copy link
Contributor Author

@wuchong

Thanks a lot for the review and helpful suggestions.
I made some changes to the problem you pointed out, but there are still some issues that need to be confirmed.
If you are available, please help to review it again.

Best

props.setProperty("name", "mongodb_binlog_source");

props.setProperty(MongoSourceConfig.CONNECTION_URI_CONFIG, checkNotNull(connectionUri));
props.setProperty(MongoSourceConfig.DATABASE_CONFIG, checkNotNull(database));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @Jiabao-Sun Some mongodb needs to have an account and password to connect, here does not consider the setting of the account and password?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @XuQianJin-Stars Please use this pattern mongodb://username:password@localhost/


MongoDBSource.Builder<RowData> builder =
MongoDBSource.<RowData>builder()
.connectionUri(uri)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jiabao-Sun There is also the url here. If Shard, there will be multiple ip:ports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. @XuQianJin-Stars
The uri represents MongoDB's connecting string, you can see https://docs.mongodb.com/manual/reference/connection-string/#standard-connection-string-format for detail.

For replica set.
mongodb://mongodb1.example.com:27317,mongodb2.example.com:27017/?replicaSet=mySet

For sharded cluster.
mongodb://mongos0.example.com:27017,mongos1.example.com:27017,mongos2.example.com:27017

Copy link
Member

@wuchong wuchong Sep 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uri pattern looks a little complicate. I'm afraid most MongoDB newbies can't set a correct URI at first time.
Can we decouple the different parts of the URI? e.g. hosts, replica-set, username, password.

Decouple can also makes things flexible, e.g. we would like to mask secrets in logging and displaying in the future.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Jiabao-Sun I also think it is better to set the username and password separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will add added separate configurations for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wuchong @XuQianJin-Stars

New config options have been added to replace the uri.

  • hosts: The comma-separated list of hostname and port pairs of the MongoDB servers.
  • user: Database user to be used when connecting to MongoDB.
  • password: Password to be used when connecting to MongoDB.
  • mongodb.replicaset: Specifies the name of the replica set. It is not necessary, but can speed up your connection times to explicitly state the servers configured by hosts.
  • mongodb.authsource: This is required only when MongoDB is configured to use authentication with another authentication database than admin. Defaults to admin.
  • mongodb.connect.timeout.ms: The time in milliseconds to attempt a connection before timing out. Defaults to 10000 (10 seconds).
  • mongodb.socket.timeout.ms: The time in milliseconds to attempt a send or receive on a socket before the attempt times out. Defaults to 0.
  • mongodb.ssl.enabled: Connector will use SSL to connect to MongoDB instances.
  • mongodb.ssl.invalid.hostname.allowed: When SSL is enabled this setting controls whether strict hostname checking is disabled during connection phase.

@yangxusun9
Copy link

@Jiabao-Sun Thanks for your reply!
to Q1,I'm sure that Snapshot phase is done but the source table is changing rapidly .you means it can't be closed if the source table is changing rapidly

@yangxusun9
If you cancel the flink job with a savepoint, must acquire the checkpoint lock first.
Rapidly changing or too short checkpointing interval may reduce the possibility of acquiring the checkpoint lock.
In addition, long time savepoint will also increase the time to cancel the task.
However, there may be other reasons that cause this problem.
For example, the operators were not interrupted or destroyed as expected.
If you can reproduce this problem, is it convenient to provide a detailed log for troubleshooting?

It is solved because of the problem of using the platform. If you directly call cancel command, it can be cancelled normally
Thanks for your help!

Hi @yangxusun9
Thank you for your feedback on this issue.
In order to prevent others from encountering the same problem, is it convenient to tell what framework is used, zeppelin or others?

Flink-VVP
This's link to documentation https://help.aliyun.com/document_detail/169590.html

@Jiabao-Sun Jiabao-Sun force-pushed the flink-connector-mongodb-cdc-latest branch 2 times, most recently from 405c28a to 7a339ab Compare September 8, 2021 11:04
@Jiabao-Sun Jiabao-Sun force-pushed the flink-connector-mongodb-cdc-latest branch from 7a339ab to 3190591 Compare September 8, 2021 11:49
@Jiabao-Sun
Copy link
Contributor Author

@wuchong @XuQianJin-Stars
The problems mentioned above has been solved.
If you are available, please help review it again.

+ "eg. localhost:27017,localhost:27018");

private static final ConfigOption<String> USER =
ConfigOptions.key("user")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about username to keep align with other connector in this repo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

"Name of the collection in the database to watch for changes.");

private static final ConfigOption<String> REPLICA_SET =
ConfigOptions.key("mongodb.replicaset")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention is just separate secret options. It seems we introduced a bounch of options and there are more.

In order to make the connector option to be concise, flexible, powerful. What do you think about just introduce a connection.options ? It can set one or more arbitrary mongodb connection options [1].

Connection options are pairs in the following form: name=value. Separate options with the ampersand (i.e. &) character name1=value1&name2=value2. In the following example, a connection includes the replicaSet and connectTimeoutMS options:

'connection.options' = 'replicaSet=test&connectTimeoutMS=300000'

In documentation, we can put the mongodb connection options link [1] when describing this option.

[1] https://docs.mongodb.com/manual/reference/connection-string/#std-label-connections-connection-options

Copy link
Contributor Author

@Jiabao-Sun Jiabao-Sun Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @wuchong,

I think this new ampersand config option may cause confusion for those MongoDB expert users that they may be more used to use connection strings.

I have two proposals.

  1. Use mongodb.connect as config prefix to mark connection options.
For example:

CREATE TABLE WITH (
     hosts = 'localhost:27017,localhost:27018',
     username = 'user',
     password = 'pwd',
     mongodb.connect.authsource = 'nonadmin',
     mongodb.connect.replicaset = 'rs',
     mongodb.connect.ssl = 'false'
);
  1. Provide config option uri as an alternative of hosts.
We can directly use uri with connection parameters and append the `username` and `password` as `userInfo`  to the connection URI.

For example: 

CREATE TABLE WITH (
     uri = 'mongodb://localhost:27017,localhost:27018/?authsource=nonadmin&replicaset=rs',
     username = 'user',
     password = 'pwd'
);

It s equivalent to the following but which has no connection option

CREATE TABLE WITH (
     hosts = 'localhost:27017,localhost:27018',
     username = 'user',
     password = 'pwd'
);

Do you think so?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with connection.options and mongodb.connect. prefix. Both way don't need us to do config mapping and we can transparent pass mongodb's options. What do you think about the two ways? @leonardBang @XuQianJin-Stars

@Jiabao-Sun a side question: does mongodb allow lower case options? e.g. mongodb://db1.example.net:27017,db2.example.net:2500/?replicaset=test&connecttimeoutms=300000

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wuchong
Yes, It's case insensitive.

As connection-string-options says:

Connection options are pairs in the following form: name=value.
The option name is case insensitive when using a driver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wuchong @leonardBang @XuQianJin-Stars

The connection.options has been added in the latest commit.
If you think mongodb.connect. prefix is better, I can make a change for it.

@Jiabao-Sun Jiabao-Sun force-pushed the flink-connector-mongodb-cdc-latest branch from 9d719cf to 4e9e630 Compare September 10, 2021 09:13
Copy link
Member

@wuchong wuchong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes looks good to me. I will merge it once build is passed. Thanks for the great work and long time patience @Jiabao-Sun .

@Jiabao-Sun
Copy link
Contributor Author

Thanks @wuchong @leonardBang @XuQianJin-Stars @carloscbl @yangxusun9 for your help.

It is a great honor to be able to complete it.
The documentation is ready too.

Cheers!

@wuchong wuchong merged commit 40a037d into apache:master Sep 10, 2021
@Jiabao-Sun Jiabao-Sun deleted the flink-connector-mongodb-cdc-latest branch September 10, 2021 10:25
Jiabao-Sun added a commit to Jiabao-Sun/flink-cdc-connectors that referenced this pull request Oct 11, 2021
ChaomingZhangCN pushed a commit to ChaomingZhangCN/flink-cdc that referenced this pull request Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support MongoDB CDC connector
5 participants