0

MongoDB架构如下

  • mongos 4台 版本 v3.2.18
  • configsvr 3台 版本 v3.2.18
  • shardsvr  主从 2台 版本 v3.2.18
  • arbiter 1 台 版本 v3.2.12

问题现象

  • 现象1:使用 mongo 连接mongos shell  无法连接  一直 夯住(重启后正常)。
  • 现象2:shardsvr 节点执行 查询语句,多次执行 偶现夯住(执行多次 find,前几次正常,最后一次直接夯住)(重启后正常)
  • 现象3:节点 接收了 退出信号 kill 15  没有相应的退出逻辑 (kill9 后重启正常)

日志

  • 执行kill 15 后   仍然在接收请求 没有退出逻辑
  • kill15 夯住了, kill 9 后重启,  正常初始化

2022-01-26T19:05:08.962+0800 I CONTROL [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
2022-01-26T19:05:17.825+0800 I NETWORK [initandlisten] connection accepted from 192.168.21.105:57177 #4025 (3295 connections now open)
2022-01-26T19:05:17.828+0800 I ACCESS [conn4025] Successfully authenticated as principal __system on local
2022-01-26T19:05:23.548+0800 I NETWORK [initandlisten] connection accepted from 192.168.11.105:45051 #4026 (3296 connections now open)
2022-01-26T19:05:23.553+0800 I ACCESS [conn4026] Successfully authenticated as principal __system on local
2022-01-26T19:05:29.923+0800 I NETWORK [initandlisten] connection accepted from 192.168.21.106:53244 #4027 (3297 connections now open)
2022-01-26T19:05:29.926+0800 I ACCESS [conn4027] Successfully authenticated as principal __system on local
2022-01-26T19:05:33.300+0800 I NETWORK [initandlisten] connection accepted from 192.168.21.106:53245 #4028 (3298 connections now open)
2022-01-26T19:05:33.303+0800 I ACCESS [conn4028] Successfully authenticated as principal __system on local
2022-01-26T19:05:35.412+0800 I SHARDING [LockPinger] cluster 192.168.11.105:27019,192.168.11.106:27019,192.168.21.105:27019 pinged successfully at 2022-01-26T19:05:35.394+0800 by distributed lock pinger ‘192.168.11.105:27019,192.168.11.106:27019,192.168.21.105:27019/iZ23avfv9a3Z:27018:1643187408:2006958444’, sleeping for 30000ms
2022-01-26T19:05:38.557+0800 I NETWORK [initandlisten] connection accepted from 192.168.11.106:43742 #4029 (3299 connections now open)
2022-01-26T19:05:38.562+0800 I ACCESS [conn4029] Successfully authenticated as principal __system on local
2022-01-26T19:05:42.049+0800 I NETWORK [initandlisten] connection accepted from 192.168.21.106:53248 #4030 (3300 connections now open)
2022-01-26T19:05:42.052+0800 I ACCESS [conn4030] Successfully authenticated as principal __system on local
2022-01-26T19:05:47.287+0800 I NETWORK [initandlisten] connection accepted from 192.168.21.106:53252 #4031 (3301 connections now open)
2022-01-26T19:05:47.290+0800 I ACCESS [conn4031] Successfully authenticated as principal __system on local
2022-01-26T19:05:53.583+0800 I NETWORK [initandlisten] connection accepted from 192.168.11.106:43746 #4032 (3302 connections now open)
2022-01-26T19:05:53.588+0800 I ACCESS [conn4032] Successfully authenticated as principal __system on local
2022-01-26T19:05:54.088+0800 I NETWORK [initandlisten] connection accepted from 192.168.11.106:43747 #4033 (3303 connections now open)
2022-01-26T19:05:54.094+0800 I ACCESS [conn4033] Successfully authenticated as principal __system on local
2022-01-26T19:06:03.971+0800 I NETWORK [initandlisten] connection accepted from 192.168.21.106:53256 #4034 (3304 connections now open)
2022-01-26T19:06:03.974+0800 I ACCESS [conn4034] Successfully authenticated as principal __system on local
2022-01-26T19:06:05.429+0800 I SHARDING [LockPinger] cluster 192.168.11.105:27019,192.168.11.106:27019,192.168.21.105:27019 pinged successfully at 2022-01-26T19:06:05.412+0800 by distributed lock pinger ‘192.168.11.105:27019,192.168.11.106:27019,192.168.21.105:27019/iZ23avfv9a3Z:27018:1643187408:2006958444’, sleeping for 30000ms
2022-01-26T19:06:06.109+0800 I NETWORK [initandlisten] connection accepted from 192.168.21.106:53257 #4035 (3305 connections now open)
2022-01-26T19:06:06.112+0800 I ACCESS [conn4035] Successfully authenticated as principal __system on local
2022-01-26T19:06:08.598+0800 I NETWORK [initandlisten] connection accepted from 192.168.21.105:57187 #4036 (3306 connections now open)
2022-01-26T19:06:08.601+0800 I ACCESS [conn4036] Successfully authenticated as principal __system on local
2022-01-26T19:06:18.304+0800 I NETWORK [initandlisten] connection accepted from 192.168.21.106:53260 #4037 (3307 connections now open)
2022-01-26T19:06:18.307+0800 I ACCESS [conn4037] Successfully authenticated as principal __system on local
2022-01-26T19:06:23.636+0800 I NETWORK [initandlisten] connection accepted from 192.168.21.105:57190 #4038 (3308 connections now open)
2022-01-26T19:06:23.639+0800 I ACCESS [conn4038] Successfully authenticated as principal __system on local
2022-01-26T19:06:30.381+0800 I NETWORK [initandlisten] connection accepted from 192.168.11.106:43755 #4039 (3309 connections now open)
2022-01-26T19:06:30.387+0800 I ACCESS [conn4039] Successfully authenticated as principal __system on local
2022-01-26T19:06:35.447+0800 I SHARDING [LockPinger] cluster 192.168.11.105:27019,192.168.11.106:27019,192.168.21.105:27019 pinged successfully at 2022-01-26T19:06:35.429+0800 by distributed lock pinger ‘192.168.11.105:27019,192.168.11.106:27019,192.168.21.105:27019/iZ23avfv9a3Z:27018:1643187408:2006958444’, sleeping for 30000ms
2022-01-26T19:06:38.674+0800 I NETWORK [initandlisten] connection accepted from 192.168.21.105:57193 #4040 (3310 connections now open)
2022-01-26T19:06:38.677+0800 I ACCESS [conn4040] Successfully authenticated as principal __system on local
2022-01-26T19:06:42.510+0800 I NETWORK [initandlisten] connection accepted from 192.168.21.105:57194 #4041 (3311 connections now open)
2022-01-26T19:06:42.513+0800 I ACCESS [conn4041] Successfully authenticated as principal __system on local
2022-01-26T19:06:53.689+0800 I NETWORK [initandlisten] connection accepted from 192.168.21.106:53267 #4042 (3312 connections now open)
2022-01-26T19:06:53.692+0800 I ACCESS [conn4042] Successfully authenticated as principal __system on local
2022-01-26T19:06:54.563+0800 I NETWORK [initandlisten] connection accepted from 192.168.11.105:45066 #4043 (3313 connections now open)
2022-01-26T19:06:54.567+0800 I ACCESS [conn4043] Successfully authenticated as principal __system on local
2022-01-26T19:07:10.351+0800 I CONTROL [main] ***** SERVER RESTARTED *****
2022-01-26T19:07:10.371+0800 I CONTROL [initandlisten] MongoDB starting : pid=19220 port=27018 dbpath=/data/mongodb 64-bit host=iZ23avfv9a3Z
2022-01-26T19:07:10.371+0800 I CONTROL [initandlisten] db version v3.2.18-1-ge8d4c69
2022-01-26T19:07:10.371+0800 I CONTROL [initandlisten] git version: e8d4c699706448df0137def23532b5bc015f772d
2022-01-26T19:07:10.371+0800 I CONTROL [initandlisten] OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
2022-01-26T19:07:10.371+0800 I CONTROL [initandlisten] allocator: tcmalloc
2022-01-26T19:07:10.371+0800 I CONTROL [initandlisten] modules: none
2022-01-26T19:07:10.371+0800 I CONTROL [initandlisten] build environment:
2022-01-26T19:07:10.371+0800 I CONTROL [initandlisten] distmod: rhel70
2022-01-26T19:07:10.371+0800 I CONTROL [initandlisten] distarch: x86_64
2022-01-26T19:07:10.371+0800 I CONTROL [initandlisten] target_arch: x86_64
2022-01-26T19:07:10.371+0800 I CONTROL [initandlisten] options: { config: “shardsvr.conf”, net: { port: 27018, unixDomainSocket: { pathPrefix: “/home/admin/mongo/” }, wireObjectCheck: false }, operationProfiling: { mode: “off”, slowOpThresholdMs: 3000 }, processManagement: { fork: true, pidFilePath: “/home/admin/mongo/shard_pid” }, replication: { oplogSizeMB: 35000, replSetName: “rs1” }, security: { clusterAuthMode: “keyFile”, keyFile: “/home/admin/mongo/keyFile” }, setParameter: { enableLocalhostAuthBypass: “true” }, sharding: { clusterRole: “shardsvr” }, storage: { dbPath: “/data/mongodb”, engine: “wiredTiger”, indexBuildRetry: false, journal: { enabled: true }, syncPeriodSecs: 60.0, wiredTiger: { collectionConfig: { blockCompressor: “snappy” }, engineConfig: { cacheSizeGB: 8, directoryForIndexes: true, journalCompressor: “snappy”, statisticsLogDelaySecs: 60 }, indexConfig: { prefixCompression: true } } }, systemLog: { destination: “file”, logAppend: true, logRotate: “reopen”, path: “/home/admin/mongo/logs/shardsvr.log” } }
2022-01-26T19:07:10.391+0800 W – [initandlisten] Detected unclean shutdown – /data/mongodb/mongod.lock is not empty.
2022-01-26T19:07:10.391+0800 W STORAGE [initandlisten] Recovering data from the last clean checkpoint.
2022-01-26T19:07:10.391+0800 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=8G,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=60),verbose=(recovery_progress),
2022-01-26T19:07:10.566+0800 I STORAGE [initandlisten] WiredTiger [1643195230:566067][19220:0x7f00bf30cdc0], txn-recover: Main recovery loop: starting at 47446/75948416
2022-01-26T19:07:10.566+0800 I STORAGE [initandlisten] WiredTiger [1643195230:566396][19220:0x7f00bf30cdc0], txn-recover: Recovering log 47446 through 47447
2022-01-26T19:07:10.596+0800 I STORAGE [initandlisten] WiredTiger [1643195230:596976][19220:0x7f00bf30cdc0], txn-recover: Recovering log 47447 through 47447
2022-01-26T19:07:10.714+0800 I STORAGE [initandlisten] Starting WiredTigerRecordStoreThread local.oplog.rs
2022-01-26T19:07:10.714+0800 I STORAGE [initandlisten] The size storer reports that the oplog contains 75795211 records totaling to 36765655454 bytes
2022-01-26T19:07:10.714+0800 I STORAGE [initandlisten] Sampling from the oplog between Jan 25 12:53:09:ab and Jan 26 17:11:23:a9 to determine where to place markers for truncation
2022-01-26T19:07:10.714+0800 I STORAGE [initandlisten] Taking 1001 samples and assuming that each section of oplog contains approximately 756602 records totaling to 367001662 bytes
2022-01-26T19:07:11.827+0800 I STORAGE [initandlisten] Placing a marker at optime Jan 25 13:10:43:159
2022-01-26T19:07:11.827+0800 I STORAGE [initandlisten] Placing a marker at optime Jan 25 13:28:36:1a2
2022-01-26T19:07:11.827+0800 I STORAGE [initandlisten] Placing a marker at optime Jan 25 13:46:44:204
2022-01-26T19:07:11.827+0800 I STORAGE [initandlisten] Placing a marker at optime Jan 25 14:03:43:4c
… 后面都是正常的 初始化流程。

目前没有排查头绪   没有异常日志,   运行节点    CPU  内存  负载  磁盘io  网络io   都处于空闲水位。

更改状态以发布