- 常见问题 >
- FAQ:MogoDB 诊断
FAQ:MogoDB 诊断¶
On this page
这篇文档提供了MongoDB的常见诊断问题和答案。
如果您在这里没有发现你寻找的答案, 请检查 常见 问题列表 或发布你的问题到 MongoDB 用户邮件列表.
在哪里可以找到``mongod``服务进程意外停止的信息?¶
如果出现 mongod 在UNIX或者在基于UNIX的平台上意外停止,或者出现 mongod 日志关闭失败或错误信息,请检查你的系统日志中关于 MongoDB 的信息。 例如, 如果日志存储在 /var/log/messages, 使用如下命令:
sudo grep mongod /var/log/messages
sudo grep score /var/log/messages
TCP 的 keepalive 时间会影响分片集群和复制集吗?¶
如果你遇到在分片集群和复制集中的成员中的socket错误,并且没有其他合理的原因,请检查TCP keepalive 的值,它在Linux系统中存储于 tcp_keepalive_time 值中。常用的keep alive时间是 7200 秒 (2 小时);然而,不同的分布和OS X可能有不同的设置。对于MongoDB,设置比较短的keep alive周期会带来更好的体验,近似于 300 秒 (5 分钟)。
在Linux系统中你可以使用下面的命令检查``tcp_keepalive_time``的值:
你可以使用下面的命令修改 tcp_keepalive_time 的值:
注解
For non-Linux systems, values greater than or equal to 600 seconds (10 minutes) will be ignored by mongod and mongos. For Linux, values greater than 300 seconds (5 minutes) will be overridden on the mongod and mongos sockets with a maximum of 300 seconds.
On Linux systems:
To view the keep alive setting, you can use one of the following commands:
sysctl net.ipv4.tcp_keepalive_time
Or:
cat /proc/sys/net/ipv4/tcp_keepalive_time
新设定的 tcp_keepalive_time 的值生效不需要你重新启动:program:mongod 或 mongos 服务. 当你重启系统后,你需要重新设定``tcp_keepalive_time`` 的值,或者通过查看你的操作系统文档来永久的设定TCP keepalive值。
To change the tcp_keepalive_time value, you can use one of the following command:
sudo sysctl -w net.ipv4.tcp_keepalive_time=<value>
Or:
echo <value> | sudo tee /proc/sys/net/ipv4/tcp_keepalive_time
These operations do not persist across system reboots. To persist the setting, add the following line to /etc/sysctl.conf:
net.ipv4.tcp_keepalive_time = <value>
On Linux, mongod and mongos processes limit the keepalive to a maximum of 300 seconds (5 minutes) on their own sockets by overriding keepalive values greater than 5 minutes.
在 OS X 系统中,使用下面的命令来查看 keep alive 的设置:
调用下面的命令来设置一个较短的keep alive周期:
sysctl net.inet.tcp.keepinit
如果你的复制集或者分片集群遇到了 keepalive 相关的问题,你必须修改所有运行 MongoDB 进程主机上 tcp_keepalive_time 的值。包含所有运行:program:mongos 或 mongod 的主机。
sysctl -w net.inet.tcp.keepinit=<value>
The above method for setting the TCP keepalive is not persistent; you will need to reset the value each time you reboot or restart a system. See your operating system’s documentation for instructions on setting the TCP keepalive value persistently.
For Windows systems:
调用下面的命令来设置一个较短的keep alive周期:
reg query HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters /v KeepAliveTime
The registry value is not present by default. The system default, used if the value is absent, is 7200000 milliseconds or 0x6ddd00 in hexadecimal.
To change the KeepAliveTime value, use the following command in an Administrator Command Prompt, where <value> is expressed in hexadecimal (e.g. 120000 is 0x1d4c0):
reg add HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\ /v KeepAliveTime /d <value>
Windows 用户应该参考`Windows Server Technet Article on KeepAliveTime configuration <http://technet.microsoft.com/en-us/library/dd349797.aspx#BKMK_2>`_ 以获得更多的关于keep alive的设定,以便于MongoDB在Windows系统上的开发。
You will need to restart mongod and mongos servers for new system-wide keepalive settings to take effect.
Why does MongoDB log so many “Connection Accepted” events?¶
If you see a very large number connection and re-connection messages in your MongoDB log, then clients are frequently connecting and disconnecting to the MongoDB server. This is normal behavior for applications that do not use request pooling, such as CGI. Consider using FastCGI, an Apache Module, or some other kind of persistent application server to decrease the connection overhead.
If these connections do not impact your performance you can use the run-time quiet option or the command-line option --quiet to suppress these messages from the log.
什么工具可以有效的监控MongoDB?¶
MongoDB 管理服务 包含了监控。MMS 监控是监控MongoDB部署的免费托管服务。在 MongoDB监控 文档中有全部的第三方工具列表。也可以参考 `MMS 文档 <http://mms.mongodb.com/help/> `_ 。
内存诊断
A full list of third-party tools is available as part of the MongoDB监控 documentation.
Memory Diagnostics for the MMAPv1 Storage Engine¶
我需要配置 swap 分区吗?¶
通常配置的系统都拥有swap分区。没有swap分区,你的系统可能在某些情况下不可靠,比如极端内存约束、内存泄露、多程序使用同一内存。swap 分区有些像放气阀允许系统释放额外的压力,而不会影响系统的整体性能。
不过,系统执行 MongoDB 的常规操作时 不 需要 swap 分区。数据库文件 内存-映射 构成了大部分 MongoDB 的内存使用。因此,mongod 在常规操作中不太可能使用swap分区。操作系统将从内存映射释放内存不需要swap,MongoDB 可以不通过swap系统向数据文件中写入数据。
什么是”工作集”,如何评估她的大小?¶
MongoDB数据库的 工作集 是客户端最频繁访问的那部分数据。你可以通过 :dbcommand: serverStatus. 输出的 workingSet 文档来评估工作集的大小。命令如下:
我的工作集大小必须和RAM(内存)匹配吗?¶
你的工作集应驻存在内存中以实现更好的性能。否则会发生很多的磁盘IO(输入/输出),除非你使用SSD(固态硬盘),这样会相当慢。
在管理工作集大小的时候,特别值得注意的一点是索引访问模式。如果你是在随机位置上插入索引(和通过散列随机生成的id一样),你将不断的更新整个索引。相反的,如果你使用近似升序来创建id(例如:日期和随机id进行多列索引),所有的将在B-树左侧的位置发生,工作集大小的索引页将变得更小。
数据库和实效大小比内存大是正确的做法。
我如何估算在我的应用中需要多大的内存?¶
内存数量取决于几个因素,包括但不限于:
数据库存储 和工作集之间的关系
操作系统的缓存策略 LRU(最近最少使用)。
:doc:`日志 </core/journaling> ` 的影响。
根据错误页面的数量或比率和其他的MMS监测工具来监测是否需要更多的内存。
- Each database connection thread will need up to 1 MB of RAM.
MongoDB使用操作系统将数据从磁盘读取到内存中。它单纯的 :ref:` 内存映射 <faq-storage-memory-mapped-files>` 所有的数据文件并且使用操作系统缓存数据。在内存运行效率低的情况下,操作系统会通常从内存中清除最近最少使用数据(LRU),例如如果客户端比访问文档更频繁的访问索引,这是索引将更可能的驻存在内存中,但是这取决于您的特定用法。
To calculate how much RAM you need, you must calculate your working set size, or the portion of your data that clients use most often. This depends on your access patterns, what indexes you have, and the size of your documents. Because MongoDB uses a thread per connection model, each database connection also will need up to 1 MB of RAM, whether active or idle.
如果页面错误是罕见的,你的工作集和内存相互适合。如果错误比率升高,高于你的风险,性能会下降。固态硬盘(SSD)的临界值比旋转型磁盘(spinning disks)更少。
Memory Diagnostics for the WiredTiger Storage Engine¶
我的工作集大小必须和RAM(内存)匹配吗?¶
成功维护分片集群最重要的两个因素是:
你的集群必须有足够的数据去做有意义的分片。分片工作在分片之间迁移数据块,直到每个分片都有数量大致相同的数据块。
数据块默认的大小是64M。集群中数据块的不平衡程度超过 迁移阀值 之前,MongoDB不会开始迁移。虽然可以通过 chunkSize 设置默认的数据块大小,这些行为有助于防止不必要的数据块迁移,这回降低你的集群的整体性能。
如果你刚刚部署了分片集群,请确认你有足够足够的数据使得分片生效。如果没有足够的数据创建多于8个64M的数据块,那么所有的数据仍然在一个分片上。或者降低你的 chunk size 设置,或者向集群中增加足够的数据。
...
"wiredTiger" : {
...
"cache" : {
"tracked dirty bytes in the cache" : <num>,
"bytes currently in the cache" : <num>,
"maximum bytes configured" : <num>,
"bytes read into cache" :<num>,
"bytes written from cache" : <num>,
"pages evicted by application threads" : <num>,
"checkpoint blocked page eviction" : <num>,
"unmodified pages evicted" : <num>,
"page split during eviction deepened the tree" : <num>,
"modified pages evicted" : <num>,
"pages selected for eviction unable to be evicted" : <num>,
"pages evicted because they exceeded the in-memory maximum" : <num>,,
"pages evicted because they had chains of deleted items" : <num>,
"failed eviction of pages that exceeded the in-memory maximum" : <num>,
"hazard pointer blocked page eviction" : <num>,
"internal pages evicted" : <num>,
"maximum page size at eviction" : <num>,
"eviction server candidate queue empty when topping up" : <num>,
"eviction server candidate queue not empty when topping up" : <num>,
"eviction server evicting pages" : <num>,
"eviction server populating queue, but not evicting pages" : <num>,
"eviction server unable to reach eviction goal" : <num>,
"pages split during eviction" : <num>,
"pages walked for eviction" : <num>,
"eviction worker thread evicting pages" : <num>,
"in-memory page splits" : <num>,
"percentage overhead" : <num>,
"tracked dirty pages in the cache" : <num>,
"pages currently held in the cache" : <num>,
"pages read into cache" : <num>,
"pages written from cache" : <num>,
},
...
作为一个相关的问题,系统只有在插入或者更新的时候分离数据块,这意味着,如果你配置了分片,但是没有继续进行插入或者更新操作,数据库将不会创建任何数据块。你可以等到应用插入数据或者 手动分块 。
最后,如果你的片键有一个低 基数能力 ,MongoDB 可能无法在数据之间创造足够的隔离。
我如何估算在我的应用中需要多大的内存?¶
为什么一个分片在分片集群中接收到的通信量不均衡?
在某些情况下,一个单独的分片或者一个分片集群中的子集会接收比例不均衡的通信和工作负载。几乎所有的情况下,都是片键不能有效的允许:ref:写扩展 <sharding-shard-key-write-scaling>.
也有可能是由于你的实例中有”热块”(hot chunks)。 在这种情况下,你能够通过分离然后迁移部分数据块解决这个问题。
在最坏的情况下,你可能需要考虑重新分片你的数据并且 选择一个不同的片键 来适应这个模式。
Via the filesystem cache, MongoDB automatically uses all free memory that is not used by the WiredTiger cache or by other processes. Data in the filesystem cache is compressed.
最后,如果你的片键有一个低 基数能力 ,MongoDB 可能无法在数据之间创造足够的隔离。
你的集群必须有足够的数据去做有意义的分片。分片工作在分片之间迁移数据块,直到每个分片都有数量大致相同的数据块。
数据块默认的大小是64M。集群中数据块的不平衡程度超过 迁移阀值 之前,MongoDB不会开始迁移。虽然可以通过 chunkSize 设置默认的数据块大小,这些行为有助于防止不必要的数据块迁移,这回降低你的集群的整体性能。
如果你刚刚部署了你的分片集群,你可能要考虑 :ref:` 在新集群中数据保持在但以分片的故障处理建议 <sharding-troubleshooting-not-splitting>` 。
如果集群最初是平衡的,但是随后的开发中数据分布不均,参考以下的可能原因:¶
你可能从集群中删除或移除大量数据。如果你增加了额外的数据,它可能有对于片键会有不同的分布
你的 片键 具有低的 基数能力 , MongoDB 不能再分隔数据块。
你的数据增长比均衡器在集群中分布数据的速度更快。这是罕见的,典型的结果是:
You can prevent most issues encountered with sharding by ensuring that you choose the best possible shard key for your deployment and ensure that you are always adding additional capacity to your cluster well before the current resources become saturated. Continue reading for specific issues you may encounter in a production environment.
一个不均衡的 写操作 需要更多的数据迁移。你可能不得不选择一个不同的片键来解决这个问题。¶
分片之间较差的网络连接,可能会导致数据块迁移完成时间太长。检查你的网络配置和分片的相互连接。
The default chunk size is 64 megabytes. MongoDB will not begin migrations until the imbalance of chunks in the cluster exceeds the migration threshold. This behavior helps prevent unnecessary chunk migrations, which can degrade the performance of your cluster as a whole.
如果迁移影响了你的集群或者应用的性能,根据产生影响的类型考虑下列选项:
如果迁移只是偶尔的打断你的集群,可以限制:ref:均衡窗口 <sharding-schedule-balancing-window> 来阻止高峰时段的平衡活动。确保有足够的剩余时间来保持数据在再次失去平衡之前。
如果均衡器总是迁移数据影响到集群的整体性能:
你可能需要尝试 减小数据块的大小 来限制迁移的大小。¶
你的集群可能过载,你可能需要尝试 增加一到两个分片 到集群中来分担负荷。
It’s also possible that you have “hot chunks.” In this case, you may be able to solve the problem by splitting and then migrating parts of these chunks.
In the worst case, you may have to consider re-sharding your data and choosing a different shard key to correct this pattern.
What can prevent a sharded cluster from balancing?¶
If you have just deployed your sharded cluster, you may want to consider the troubleshooting suggestions for a new cluster where data remains on a single shard.
If the cluster was initially balanced, but later developed an uneven distribution of data, consider the following possible causes:
- You have deleted or removed a significant amount of data from the cluster. If you have added additional data, it may have a different distribution with regards to its shard key.
- Your shard key has low cardinality and MongoDB cannot split the chunks any further.
- Your data set is growing faster than the balancer can distribute
data around the cluster. This is uncommon and
typically is the result of:
- a balancing window that is too short, given the rate of data growth.
- an uneven distribution of write operations that requires more data migration. You may have to choose a different shard key to resolve this issue.
- poor network connectivity between shards, which may lead to chunk migrations that take too long to complete. Investigate your network configuration and interconnections between shards.
Why do chunk migrations affect sharded cluster performance?¶
If migrations impact your cluster or application’s performance, consider the following options, depending on the nature of the impact:
- If migrations only interrupt your clusters sporadically, you can limit the balancing window to prevent balancing activity during peak hours. Ensure that there is enough time remaining to keep the data from becoming out of balance again.
- If the balancer is always migrating chunks to the detriment of
overall cluster performance:
- You may want to attempt decreasing the chunk size to limit the size of the migration.
- Your cluster may be over capacity, and you may want to attempt to add one or two shards to the cluster to distribute load.
It’s also possible that your shard key causes your application to direct all writes to a single shard. This kind of activity pattern can require the balancer to migrate most data soon after writing it. Consider redeploying your cluster with a shard key that provides better write scaling.