0

多数据中心部署mongodb,版本为3.4.2:
IDC-1:一个primary,一个secondary和一个arbiter
IDC-2: 部署了2个secondary,这两个没有选举权,也不能成为主节点,和IDC-1组成一个集群,两个数据中心之间的网络带宽是10M/s,部署的时候是好好的,有一次断网后出现,IDC-2中的secondary不能同步主的oplog,rs.printSlaveReplicationInfo()

soapa-rs:PRIMARY> rs.printSlaveReplicationInfo()
source: mongodb-api-secondary-1.novalocal:27017
        syncedTo: Thu Jul 25 2019 08:51:23 GMT+0800 (CST)
        0 secs (0 hrs) behind the primary
source: mongodb-api-secondary-3.novalocal:27017
        syncedTo: Wed Jul 24 2019 20:02:38 GMT+0800 (CST)
        46125 secs (12.81 hrs) behind the primary
source: mongodb-api-secondary-2.novalocal:27017
        syncedTo: Thu Jul 25 2019 07:51:23 GMT+0800 (CST)
        3600 secs (1 hrs) behind the primary

报错信息如下:
2019-07-23T16:10:21.065+0800 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to mongodb-api-primary-1.novalocal:27017
2019-07-23T16:10:21.091+0800 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to mongodb-api-primary-1.novalocal:27017
2019-07-23T16:10:30.170+0800 I REPL [replication-139] Restarting oplog query due to error: ExceededTimeLimit: Operation timed out, request was RemoteCommand 6066561 -- target:mongodb-api-primary-1.novalocal:27017 db:local expDate:2019-07-23T16:10:30.170+0800 cmd:{ getMore: 12908190518, collection: "oplog.rs", maxTimeMS: 5000, term: 28, lastKnownCommittedOpTime: { ts: Timestamp 1563869416000|1, t: 28 } }. Last fetched optime (with hash): { ts: Timestamp 1563866709000|361, t: 27 }[-7765488033541930267]. Restarts remaining: 3
2019-07-23T16:10:30.170+0800 I REPL [replication-139] Scheduled new oplog query Fetcher source: mongodb-api-primary-1.novalocal:27017 database: local query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1563866709000|361 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 28 } query metadata: { $replData: 1, $ssm: { $secondaryOk: true } } active: 1 timeout: 10000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 6066613 -- target:mongodb-api-primary-1.novalocal:27017 db:local cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp 1563866709000|361 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 28 } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms

出现了timeout的报错,rs.printSlaveReplicationInfo()的信息如下:

soapa-rs:PRIMARY> rs.printSlaveReplicationInfo()
source: mongodb-api-secondary-1.novalocal:27017
        syncedTo: Thu Jul 25 2019 08:51:23 GMT+0800 (CST)
        0 secs (0 hrs) behind the primary
source: mongodb-api-secondary-3.novalocal:27017
        syncedTo: Wed Jul 24 2019 20:02:38 GMT+0800 (CST)
        46125 secs (12.81 hrs) behind the primary
source: mongodb-api-secondary-2.novalocal:27017
        syncedTo: Thu Jul 25 2019 07:51:23 GMT+0800 (CST)
        3600 secs (1 hrs) behind the primary

查阅资料后我们将版本升级为3.4.11,参考资料:
https://jira.mongodb.org/browse/SERVER-27918
https://jira.mongodb.org/browse/SERVER-19605
https://github.com/mongodb/mongo/commit/5dbccb4a861aa2db993dd673097a1300bcdc9cca
并使用了命令:
db.adminCommand( { setParameter: 1, oplogInitialFindMaxSeconds: 1000000  } )
没有报timeout的错误了,但是oplog还是没有同步过去,而且oplog延迟越来越多,这个问题怎么解决了,看上去是oplog的问题,我们的带宽有限制在10M/s,但是我们的数据量没有那么多,IDC-2的带宽被mongo secondary占满了,但是数据没有同步过去,oplog反而越走越远,开发环境可以复现这个问题,限制secondary的带宽就会出现,放开就好了,我想的到的结果是,他能慢慢的同步过去,因为我们的写入和修改很少,数据量也很少,慢慢同步肯定能赶上的,但是为什么oplog就没有动呢,还越走越远,求解啊!