MongoDB特性和数据模型的关系¶
On this page
MongoDB的数据建模不仅仅取决于应用程序的数据需求,也需要考虑到MongoDB本身的一些特性。例如,有些数据模型可以让查询更加有效,有些可以增加插入或更新的并发量,有些则可以更有效地把事务均衡到各个分片服务器上。
These factors are operational or address requirements that arise outside of the application but impact the performance of MongoDB based applications. When developing a data model, analyze all of your application’s read and write operations in conjunction with the following considerations.
文档增长性¶
在 3.0.0 版更改.
Some updates to documents can increase the size of documents. These updates include pushing elements to an array (i.e. $push) and adding new fields to a document.
举例来说,如果你的程序的更新操作会导致文档大小增加,那么你可能要重新设计下数据模型,在不同文档之间使用引用的方式而非内嵌、冗余的数据结构。
MongoDB会自动调整空白填充的大小以尽可能的减小文档迁移。你也可以使用一个 预分配 策略来防止文档的增长。 具体关于使用 预分配 来处理文档增长的例子可以参见 预聚合报表案例
You may also use a pre-allocation strategy to explicitly avoid document growth. Refer to the Pre-Aggregated Reports Use Case for an example of the pre-allocation approach to handling document growth.
See MMAPv1 Storage Engine for more information on MMAPv1.
原子性¶
在MongoDB里面,所有操作在 document 级别具有原子性. 一个 单个 写操作最多只可以修改一个文档。 即使是一个会改变同一个集合中多个文档的命令,在同一时间也只会操作一个文档。 [1] 尽可能保证那些需要在一个原子操作内进行修改的字段定义在同一个文档里面。如果你的应用程序允许对两个数据的非原子性更新操作,那么可以把这些数据定义在不同的文档内。
把相关数据定义到同一个文档里面的内嵌方式有利于这种原子性操作。对于那些使用引用来关联相关数据的数据模型,应用程序必须再用额外的读和写的操作去取回和修改相关的数据。
关于原子性操作更新单个文档的例子,参见 原子性事务建模。
[1] | 文档级原子性操作包含所有针对于一个文档的操作: 即便是涉及到多个子文档的多个操作,只要是在同一个文档之内,这些操作仍旧是有原子性的。 |
分片¶
MongoDB 使用 sharding (分片)来实现水平扩展。使用分片的集群可以支持海量的数据和高并发读写。用户可以使用分片技术把一个数据库内的某一个集合的数据进行分区,从而达到把数据分布到多个 mongod 实例(或分片上)的目的。
Mongodb 依据 分片键 分发数据和应用程序的事务请求。 选择一个合适的分片键会对性能有很大的影响,也会促进或者阻碍MongoDB的定向分片查询和增强的写性能。所以在选择分片键时候要仔细考量分片键所用的字段。
更多信息请参见 /core/sharding-introduction 及 片键
索引¶
对常用操作可以使用索引来提高性能。对查询条件中常见的字段,以及需要排序的字段创建索引。MongoDB会对 _id 字段自动创建唯一索引。
创建索引时,需要考虑索引的下述特征:
- Each index requires at least 8 kB of data space.
每增加一个索引,就会对写操作性能有一些影响。对于一个写多读少的集合,索引会变得很费时因为每个插入必须要更新所有索引。
每增加一个索引,就会对写操作性能有一些影响。对于一个写多读少的集合,索引会变得很费时因为每个插入必须要更新所有索引。
每个索引都会占一定的硬盘空间和内存(对于活跃的索引)。索引有可能会用到很多这样的资源,因此对这些资源要进行管理和规划,特别是在计算热点数据大小的时候。
关于索引的更多信息请参见 索引策略 以及 分析查询性能。另外, MongoDB的 database profiler 可以帮你找出一些使用索引不当而导致低效的查询。
集合的数量¶
在某些情况下,你可能会考虑把相关的数据保存到多个而不是一个集合里面。
我们来看一个样本集合 logs ,用来存储不同环境和应用程序的日志。 这个 logs 集合里面的文档例子:
{ log: "dev", ts: ..., info: ... }
{ log: "debug", ts: ..., info: ...}
If the total number of documents is low, you may group documents into collection by type. For logs, consider maintaining distinct log collections, such as logs_dev and logs_debug. The logs_dev collection would contain only the documents related to the dev environment.
一般来说,很大的集合数量对性能没有什么影响,反而在某些场景下有不错的性能。使用不同的集合在高并发批处理场景下会有很好的帮助。
当使用有大量集合的数据模型时,请注意一下几点:
每一个集合有几个KB的额外开销。
Each index, including the index on _id, requires at least 8 kB of data space.
每一个MongoDB的 database 有一个且仅一个命名文件(namespace file)(i.e. <database>.ns) 。这个命名文件保存了数据库的所有元数据。每个索引和集合在这个文件里都有一条记录。这个文件的大小是有限制的,具体信息请参见: :limit:` 命名文件大小限制 <Size of Namespace File>`。
MongoDB 的命名文件有大小的限制: limits on the number of namespaces。 你可能希望知道当前命名文件的限制 - 你可以通过在 mongo shell: 下面执行下述命令:
db.system.namespaces.count()
一个命名文件中可以容纳的命名记录数取决于命名文件 <database>.ns 文件的大小。 命名文件默认的大小限制是16 MB。
To change the size of the new namespace file, start the server with the option --nssize <new size MB>. For existing databases, after starting up the server with --nssize, run the db.repairDatabase() command from the mongo shell. For impacts and considerations on running db.repairDatabase(), see repairDatabase.
Collection Contains Large Number of Small Documents¶
You should consider embedding for performance reasons if you have a collection with a large number of small documents. If you can group these small documents by some logical relationship and you frequently retrieve the documents by this grouping, you might consider “rolling-up” the small documents into larger documents that contain an array of embedded documents.
“Rolling up” these small documents into logical groupings means that queries to retrieve a group of documents involve sequential reads and fewer random disk accesses. Additionally, “rolling up” documents and moving common fields to the larger document benefit the index on these fields. There would be fewer copies of the common fields and there would be fewer associated key entries in the corresponding index. See Indexes for more information on indexes.
However, if you often only need to retrieve a subset of the documents within the group, then “rolling-up” the documents may not provide better performance. Furthermore, if small, separate documents represent the natural model for the data, you should maintain that model.
Storage Optimization for Small Documents¶
Each MongoDB document contains a certain amount of overhead. This overhead is normally insignificant but becomes significant if all documents are just a few bytes, as might be the case if the documents in your collection only have one or two fields.
Consider the following suggestions and strategies for optimizing storage utilization for these collections:
Use the _id field explicitly.
MongoDB clients automatically add an _id field to each document and generate a unique 12-byte ObjectId for the _id field. Furthermore, MongoDB always indexes the _id field. For smaller documents this may account for a significant amount of space.
To optimize storage use, users can specify a value for the _id field explicitly when inserting documents into the collection. This strategy allows applications to store a value in the _id field that would have occupied space in another portion of the document.
You can store any value in the _id field, but because this value serves as a primary key for documents in the collection, it must uniquely identify them. If the field’s value is not unique, then it cannot serve as a primary key as there would be collisions in the collection.
Use shorter field names.
注解
Shortening field names reduces expressiveness and does not provide considerable benefit for larger documents and where document overhead is not of significant concern. Shorter field names do not reduce the size of indexes, because indexes have a predefined structure.
In general, it is not necessary to use short field names.
MongoDB stores all field names in every document. For most documents, this represents a small fraction of the space used by a document; however, for small documents the field names may represent a proportionally large amount of space. Consider a collection of small documents that resemble the following:
{ last_name : "Smith", best_score: 3.9 }
If you shorten the field named last_name to lname and the field named best_score to score, as follows, you could save 9 bytes per document.
{ lname : "Smith", score : 3.9 }
Embed documents.
In some cases you may want to embed documents in other documents and save on the per-document overhead. See Collection Contains Large Number of Small Documents.
Data Lifecycle Management¶
Data modeling decisions should take data lifecycle management into consideration.
The Time to Live or TTL feature of collections expires documents after a period of time. Consider using the TTL feature if your application requires some data to persist in the database for a limited period of time.
Additionally, if your application only uses recently inserted documents, consider 限制集. Capped collections provide first-in-first-out (FIFO) management of inserted documents and efficiently support operations that insert and read documents based on insertion order.