本文来自《2021MongoDB技术实践与应用案例征集活动》入围案例奖作品
作者:任坤
1. 背景
线上mongo 4分片集群,版本percona 4.2,查看实时qps发现shard1的update很高,而剩余3个shard的update都很低。
–shard1
–shard2
要么是某个分片表的数据分布不均匀,要么就是没有开启分片。
2.诊断
先核查一下大表。登录mongos,切换到该db,执行如下命令,每个表输出一行,分别为表名和size(MB) var collNames = db.getCollectionNames();
for (var i = 0; i < collNames.length; i++) { var coll = db.getCollection(collNames[i]);
var stats = coll.stats(1024 * 1024); print(stats.ns, stats.storageSize); }
找出最大的10个表。执行db.table.getShardDistribution(),发现每个大表都分布均匀。
查看shard1的mongod.log。没有发现任何慢查询,目前慢查询阈值为100ms。
查看shard1的oplog。use local db.oplog.rs.find({ “op”:”u” }).sort({$natural: -1}).limit(10) #查询最新的10条
update oplog { "ts" : Timestamp(1634090487, 1781), "t" : NumberLong(4), "h" : NumberLong(0), "v" : 2, "op" : "u", "ns" : "prod.prod_XXX", "ui" : UUID("22154461-6305-491d-8f1d-9a1630753508"), "o2" : { "_id" : "2021-10-10 18:00:00#505237968" }, "wall" : ISODate("2021-10-13T02:01:27.707Z"), "o" : { "$v" : 1, "$set" : { "__system" : { "pull_time" : "2021-10-13 10:01:27" } } } } { "ts" : Timestamp(1634090487, 1780), "t" : NumberLong(4), "h" : NumberLong(0), "v" : 2, "op" : "u", "ns" : "prod.prod_XXX", "ui" : UUID("22154461-6305-491d-8f1d-9a1630753508"), "o2" : { "_id" : "2021-10-10 18:00:00#505237974" }, "wall" : ISODate("2021-10-13T02:01:27.707Z"), "o" : { "$v" : 1, "$set" : { "__system" : { "pull_time" : "2021-10-13 10:01:27" } } } } { "ts" : Timestamp(1634090487, 1779), "t" : NumberLong(4), "h" : NumberLong(0), "v" : 2, "op" : "u", "ns" : "prod.prod_XXX", "ui" : UUID("22154461-6305-491d-8f1d-9a1630753508"), "o2" : { "_id" : "2021-10-10 18:00:00#505237975" }, "wall" : ISODate("2021-10-13T02:01:27.707Z"), "o" : { "$v" : 1, "$set" : { "__system" : { "pull_time" : "2021-10-13 10:01:27" } } } } { "ts" : Timestamp(1634090487, 1778), "t" : NumberLong(4), "h" : NumberLong(0), "v" : 2, "op" : "u", "ns" : "prod.prod_XXX", "ui" : UUID("22154461-6305-491d-8f1d-9a1630753508"), "o2" : { "_id" : "2021-10-10 18:00:00#505237976" }, "wall" : ISODate("2021-10-13T02:01:27.707Z"), "o" : { "$v" : 1, "$set" : { "__system" : { "pull_time" : "2021-10-13 10:01:27" } } } } { "ts" : Timestamp(1634090487, 1777), "t" : NumberLong(4), "h" : NumberLong(0), "v" : 2,
"op" : "u", "ns" : "prod.prod_XXX", "ui" : UUID("22154461-6305-491d-8f1d-9a1630753508"), "o2" : { "_id" : "2021-10-10 18:00:00#505238113" }, "wall" : ISODate("2021-10-13T02:01:27.707Z"), "o" : { "$v" : 1, "$set" : { "__system" : { "pull_time" : "2021-10-13 10:01:27" } } } } { "ts" : Timestamp(1634090487, 1776), "t" : NumberLong(4), "h" : NumberLong(0), "v" : 2, "op" : "u", "ns" : "prod.prod_XXX", "ui" : UUID("22154461-6305-491d-8f1d-9a1630753508"), "o2" : { "_id" : "2021-10-10 18:00:00#505720693" }, "wall" : ISODate("2021-10-13T02:01:27.707Z"), "o" : { "$v" : 1, "$set" : { "__system" : { "pull_time" : "2021-10-13 10:01:27" } } } } { "ts" : Timestamp(1634090487, 1775), "t" : NumberLong(4), "h" : NumberLong(0), "v" : 2, "op" : "u", "ns" : "prod.prod_XXX", "ui" : UUID("22154461-6305-491d-8f1d-9a1630753508"), "o2" : { "_id" : "2021-10-10 18:00:00#505726096" }, "wall" : ISODate("2021-10-13T02:01:27.707Z"), "o" : { "$v" : 1, "$set" : { "__system" : { "pull_time" : "2021-10-13 10:01:27" } } } } { "ts" : Timestamp(1634090487, 1774), "t" : NumberLong(4), "h" : NumberLong(0), "v" : 2, "op" : "u", "ns" : "prod.prod_XXX", "ui" : UUID("22154461-6305-491d-8f1d-9a1630753508"), "o2" : { "_id" : "2021-10-10 18:00:00#505726320" }, "wall" : ISODate("2021-10-13T02:01:27.707Z"), "o" : { "$v" : 1, "$set" : { "__system" : { "pull_time" : "2021-10-13 10:01:27" } } } } { "ts" : Timestamp(1634090487, 1773), "t" : NumberLong(4), "h" : NumberLong(0), "v" : 2, "op" : "u", "ns" : "prod.prod_XXX", "ui" : UUID("22154461-6305-491d-8f1d-9a1630753508"), "o2" : { "_id" : "2021-10-10 18:00:00#505750437" }, "wall" : ISODate("2021-10-13T02:01:27.707Z"), "o" : { "$v" : 1, "$set" : { "__system" : { "pull_time" : "2021-10-13
10:01:27" } } } } { "ts" : Timestamp(1634090487, 1772), "t" : NumberLong(4), "h" : NumberLong(0), "v" : 2, "op" : "u", "ns" : "prod.prod_XXX", "ui" : UUID("22154461-6305-491d-8f1d-9a1630753508"), "o2" : { "_id" : "2021-10-10 18:00:00#505750438" }, "wall" : ISODate("2021-10-13T02:01:27.707Z"), "o" : { "$v" : 1, "$set" : { "__system" : { "pull_time" : "2021-10-13 10:01:27" } } } } { "ts" : Timestamp(1634090487, 1771), "t" : NumberLong(4), "h" : NumberLong(0), "v" : 2, "op" : "u", "ns" : "prod.prod_XXX", "ui" : UUID("22154461-6305-491d-8f1d-9a1630753508"), "o2" : { "_id" : "2021-10-10 18:00:00#505750439" }, "wall" : ISODate("2021-10-13T02:01:27.706Z"), "o" : { "$v" : 1, "$set" : { "__system" : { "pull_time" : "2021-10-13 10:01:27" } } } }
发现都是针对prod_XXX表的update,而该表没有分片。
和开发核对后,对其_id列创建hash索引并开启分片。
登录mongos,切换到该db,执行:
db.prod_XXX.ensureIndex({_id: "hashed"}, {background: true}) sh.shardCollection("prod.prod_XXX", { _id : "hashed" } )
update以肉眼可见的速度均衡,问题解决。
–shard1
–shard2
3.小结
本次案例很简单也很常见,mongo分片如果tps不均衡,可以参照上述方法快速定位并解决。
用惯了mysql的人刚转手mongo会很不习惯,尤其是很多sql语法根本记不住,比如本文的查询集合大小以及查看oplog的命令,最好是记个笔记用到的时候直接翻出来看。
关于作者:
任坤,现居珠海,先后担任专职 Oracle 和 MySQL DBA,现在主要负责 MySQL、MongoDB、Redis和Clickhouse 维护工作
可以加上巡检,针对多分片集群超过多少G的 collection,如果没开分片,可以巡检发出来