翻译或纠错本页面

文本索引

在 3.2 版更改.

Starting in MongoDB 3.2, MongoDB introduces a version 3 of the text index. Key features of the new version of the index are:

Starting in MongoDB 3.2, version 3 is the default version for new text indexes.

Overview

MongoDB provides text indexes to support text search queries on string content. text indexes can include any field whose value is a string or an array of string elements.

创建文本搜索

重要

一个集合最多只能创建 一个 文本 索引。

请使用 db.collection.ensureIndex() 方法创建 文本 索引。为了索引一个存储字符串或者字符串数组的键,您需要在创建选项中包含这个键并指定为 "text" ,如下:

db.reviews.createIndex( { comments: "text" } )

You can index multiple fields for the text index. The following example creates a text index on the fields subject and comments:

db.reviews.createIndex(
   {
     subject: "text",
     comments: "text"
   }
 )

A compound index can include text index keys in combination with ascending/descending index keys. For more information, see 复合索引.

In order to drop a text index, use the index name. See 使用索引名称来删除 文本 索引 for more information.

Specify Weights

For a text index, the weight of an indexed field denotes the significance of the field relative to the other indexed fields in terms of the text search score.

For each indexed field in the document, MongoDB multiplies the number of matches by the weight and sums the results. Using this sum, MongoDB then calculates the score for the document. See $meta operator for details on returning and sorting by text scores.

The default weight is 1 for the indexed fields. To adjust the weights for the indexed fields, include the weights option in the db.collection.createIndex() method.

For more information using weights to control the results of a text search, see 通过权重控制搜索结果.

Wildcard Text Indexes

When creating a text index on multiple fields, you can also use the wildcard specifier ($**). With a wildcard text index, MongoDB indexes every field that contains string data for each document in the collection. The following example creates a text index using the wildcard specifier:

db.collection.createIndex( { "$**": "text" } )

This index allows for text search on all fields with string content. Such an index can be useful with highly unstructured data if it is unclear which fields to include in the text index or for ad-hoc querying.

Wildcard text indexes are text indexes on multiple fields. As such, you can assign weights to specific fields during index creation to control the ranking of the results. For more information using weights to control the results of a text search, see 通过权重控制搜索结果.

Wildcard text indexes, as with all text indexes, can be part of a compound indexes. For example, the following creates a compound index on the field a as well as the wildcard specifier:

db.collection.createIndex( { a: 1, "$**": "text" } )

As with all compound text indexes, since the a precedes the text index key, in order to perform a $text search with this index, the query predicate must include an equality match conditions a. For information on compound text indexes, see Compound Text Indexes.

Case Insensitivity

在 3.2 版更改.

The version 3 text index supports the common C, simple S, and for Turkish languages, the special T case foldings as specified in Unicode 8.0 Character Database Case Folding.

The case foldings expands the case insensitivity of the text index to include characters with diacritics, such as é and É, and characters from non-Latin alphabets, such as “И” and “и” in the Cyrillic alphabet.

Version 3 of the text index is also diacritic insensitive. As such, the index also does not distinguish between é, É, e, and E.

文本搜索支持对文档中的字符串内容进行搜索。MongoDB提供 $text 操作符来执行文本搜索,对于查询和 聚集管道 都是可用的。

Diacritic Insensitivity

在 3.2 版更改.

With version 3, text index is diacritic insensitive. That is, the index does not distinguish between characters that contain diacritical marks and their non-marked counterpart, such as é, ê, and e. More specifically, the text index strips the characters categorized as diacritics in Unicode 8.0 Character Database Prop List.

如果一篇文档的被索引键中包含了搜索词,则为它赋予一个分数。这个分数决定了一篇文档和一个给定搜索查询的相关性。

Previous versions of the text index treat characters with diacritics as distinct.

Tokenization Delimiters

在 3.2 版更改.

For tokenization, version 3 text index uses the delimiters categorized under Dash, Hyphen, Pattern_Syntax, Quotation_Mark, Terminal_Punctuation, and White_Space in Unicode 8.0 Character Database Prop List.

For example, if given a string "Il a dit qu'il «était le meilleur joueur du monde»", the text index treats «, », and spaces as delimiters.

Previous versions of the index treat « as part of the term "«était" and » as part of the term "monde»".

Index Entries

text index tokenizes and stems the terms in the indexed fields for the index entries. text index stores one index entry for each unique stemmed term in each indexed field for each document in the collection. The index uses simple language-specific suffix stemming.

支持的语言和停止词(Stop Words, 信息检索术语)

操作符 $text 可以搜索单层和词组。当整个词根被匹配上时,查询才算匹配上。例如,如果一篇文档的键包含了词 blueberry ,如果搜索的是 blue 将不会匹配上。相反,如果查询的是 blueberry 或者 blueberries 将会得到匹配。

If you specify a language value of "none", then the text index uses simple tokenization with no list of stop words and no stemming.

To specify a language for the text index, see 指定文本索引的语言.

sparse Property

text indexes are sparse by default and ignore the sparse: true option. If a document lacks a text index field (or the field is null or an empty array), MongoDB does not add an entry for the document to the text index. For inserts, MongoDB inserts the document but does not add to the text index.

For a compound index that includes a text index key along with keys of other types, only the text index field determines whether the index references a document. The other keys do not determine whether the index references the documents or not.

限制

One Text Index Per Collection

一个集合最多只能创建 一个 文本 索引。

文本搜索与提示

如果查询中包含了 $text 表达式,您不能使用 hint()

Text Index and Sort

Sort operations cannot obtain sort order from a text index, even from a compound text index; i.e. sort operations cannot use the ordering in the text index.

复合索引

MongoDB中 复合索引 可以包含一个 文本 索引键,和其它递增/递减索引键。但是,这些复合索引都有如下限制:

  • 复合 文本 索引不能包含任何其他特殊类型索引,比如 多键索引 or 地理索引 键。

  • 如果复合 文本 索引中有其他键排在 文本 索引键 之前 ,当查询 $text 时,这条查询必须包含对这些键的 相等匹配条件

See also Text Index and Sort for additional limitations.

For an example of a compound text index, see 限制被扫描项的数量.

Drop a Text Index

To drop a text index, pass the name of the index to the db.collection.dropIndex() method. To get the name of the index, run the db.collection.getIndexes() method.

For information on the default naming scheme for text indexes as well as overriding the default name, see 为 文本 索引指定名称.

MongoDB中 文本 索引创建以后会将集合中接下来的记录的空间分配方式更改为 usePowerOf2Sizes

text indexes have the following storage requirements and performance costs:

  • MongoDB中 ``文本``索引可能会很大。它们会为每篇被插入文档中被索引键的每个唯一的词根(post-stemmed word)创建索引项。

  • 创建 文本 索引和创建一个大的多键索引很像,并且对于索引同样的数据,所需时间也会长于创建一个简单的有序(非地理)索引。

  • 当在一个已经存在的集合上创建一个大的 文本 索引,请确保您有足够高的文件描述符打开个数的限制。参见 推荐设置

  • MongoDB中 文本 索引会影响插入,因为MOngoDB必须为每个新插入的文档中的每个被索引键的数据中的每个唯一词根添加索引项。

  • 此外, 文本 索引不会存储词组或者文档中词的近义词的信息。所以,当整个集合可以容纳在内存中时,词组查询会比较高效。

Text Search Support

The text index supports $text query operations. For examples of text search, see the $text reference page. For examples of $text operations in aggregation pipelines, see 在聚合管道中使用文本搜索.