ElasticSearch Cookbook(Second Edition)
上QQ阅读APP看书,第一时间看更新

Mapping base types

Using explicit mapping allows you to be faster when you start inserting data using a schema-less approach, without being concerned about the field types. Therefore, in order to achieve better results and performance when indexing, it's necessary to manually define a mapping.

Fine-tuning the mapping has some advantages, as follows:

  • Reduces the size of the index on disk (disabling functionalities for custom fields)
  • Indexes only interesting fields (a general boost to performance)
  • Precooks data for a fast search or real-time analytics (such as aggregations)
  • Correctly defines whether a field must be analyzed in multiple tokens or whether it should be considered as a single token

ElasticSearch also allows you to use base fields with a wide range of configurations.

Getting ready

You need a working ElasticSearch cluster and an index named test (refer to the Creating an index recipe in Chapter 4, Basic Operations) where you can put the mappings.

How to do it...

Let's use a semi-real-world example of a shop order for our ebay-like shop.

Initially, we define the following order:

Our order record must be converted to an ElasticSearch mapping definition:

{
  "order" : {
    "properties" : {
      "id" : {"type" : "string", "store" : "yes" , "index":"not_analyzed"},
      "date" : {"type" : "date", "store" : "no" , "index":"not_analyzed"},
      "customer_id" : {"type" : "string", "store" : "yes" , "index":"not_analyzed"},"sent" : {"type" : "boolean", "index":"not_analyzed"},"name" : {"type" : "string",  "index":"analyzed"},"quantity" : {"type" : "integer", "index":"not_analyzed"},"vat" : {"type" : "double", "index":"no"}
    }
  }
}

Now the mapping is ready to be put in the index. We'll see how to do this in the Putting a mapping in an index recipe in Chapter 4, Basic Operations.

How it works...

The field type must be mapped to one of ElasticSearch's base types, adding options for how the field must be indexed.

The next table is a reference of the mapping types:

Depending on the data type, it's possible to give explicit directives to ElasticSearch on processing the field for better management. The most-used options are:

  • store: This marks the field to be stored in a separate index fragment for fast retrieval. Storing a field consumes disk space, but it reduces computation if you need to extract the field from a document (that is, in scripting and aggregations). The possible values for this option are no and yes (the default value is no).

    Note

    Stored fields are faster than others at faceting.

  • index: This configures the field to be indexed (the default value is analyzed). The following are the possible values for this parameter:
    • no: This field is not indexed at all. It is useful to hold data that must not be searchable.
    • analyzed: This field is analyzed with the configured analyzer. It is generally lowercased and tokenized, using the default ElasticSearch configuration (StandardAnalyzer).
    • not_analyzed: This field is processed and indexed, but without being changed by an analyzer. The default ElasticSearch configuration uses the KeywordAnalyzer field, which processes the field as a single token.
  • null_value: This defines a default value if the field is missing.
  • boost: This is used to change the importance of a field (the default value is 1.0).
  • index_analyzer: This defines an analyzer to be used in order to process a field. If it is not defined, the analyzer of the parent object is used (the default value is null).
  • search_analyzer: This defines an analyzer to be used during the search. If it is not defined, the analyzer of the parent object is used (the default value is null).
  • analyzer: This sets both the index_analyzer and search_analyzer field to the defined value (the default value is null).
  • include_in_all: This marks the current field to be indexed in the special _all field (a field that contains the concatenated text of all the fields). The default value is true.
  • index_name: This is the name of the field to be stored in the Index. This property allows you to rename the field at the time of indexing. It can be used to manage data migration in time without breaking the application layer due to changes.
  • norms: This controls the Lucene norms. This parameter is used to better score queries, if the field is used only for filtering. Its best practice to disable it in order to reduce the resource usage (the default value is true for analyzed fields and false for the not_analyzed ones).

There's more...

In this recipe, we saw the most-used options for the base types, but there are many other options that are useful for advanced usage.

An important parameter, available only for string mapping, is the term_vector field (the vector of the terms that compose a string; check out the Lucene documentation for further details at http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/index/Terms.html) to define the details:

  • no: This is the default value, which skips term_vector field
  • yes: This stores the term_vector field
  • with_offsets: This stores term_vector with a token offset (the start or end position in a block of characters)
  • with_positions: This stores the position of the token in the term_vector field
  • with_positions_offsets: This stores all the term_vector data

    Note

    Term vectors allow fast highlighting but consume a lot of disk space due to the storage of additional text information. It's best practice to activate them only in the fields that require highlighting, such as title or document content.

See also