Introduction to Elasticsearch
您目前处于:大数据  2016年10月04日

What is Elasticsearch?

Elasticsearch is an open-source, distributed, real-time, document indexer with support for online analytics

Features at a Glance

Extremely elegant and powerful REST API

• Almost all search engine features are accessible over plain HTTP 

• JSON formatted queries and results

• Can test/experiment/debug with simple tools like curl

Schema-Less Data Model

• Allows great flexibility for application designer

• Can index arbitrary documents right away with no schema metadata 

• Can also tweak type/field mappings for indexes as needed

Fully Distributed and Highly-Available

• Tunable index-level write-path (index) and read-path (query) distribution policies

• P2P node operations with recoverable master node, multicast auto-discovery (configurable) 

• Plays well in VM/Cloud provisioned environments

• Indexes scale horizontally as new nodes are added

• Search Cluster performs automatic failover and recovery

Advanced Search Features

• Full-Text search, autocomplete, facets, real-time search analytics 

• Powerful Query DSL

• Multi-Language Support

• Built-in Tokenizers,Filters and Analyzers for most common search needs

Concepts

Clusters/Nodes:

ES is a deployed as a cluster of individual nodes with a single master node. Each node can have many indexes hosted on it.

Documents:

In ES you index documents. Document indexing is a distributed atomic operation with versioning support and transaction logs. Every document is associated with an index and has at least a type and an id.

Indexes:

Similar to a database in traditional relational stores. Indexes are a logical namespace and have a primary shard and zero or more replica shards in the cluster. A single index has mappings which may define several types stored in the index. Indexes store a mapping between terms and documents.

Mappings:

Mappings are like schemas in relational database. Mappings define a type within an index along with some index-wide settings. Unlike a traditional database, in ES types do not have to be explicitly defined ahead of time. Indexes can be created without explicit mappings at all in which case ES infer a mapping from the source documents being indexed.

Types:

Types are like tables in a database. A type defines fields along with optional information about how that field should be indexed. If a request is made to index a document with fields that don’t have explicit type information ES will attempt to guess an appropriate type based on the indexed data.

Queries:

A query is a request to retrieve matching documents (“hits”) from one or more indexes. ES can query for exact term matches or more sophisticated full text searches across several fields or indexes at once. The query options are also quite powerful and support things like sorting, filtering, aggregate statistics, facet counts and much more.

Analysis:

Analysis is the process of converting unstructured text into terms. It includes things like ignoring punctuation, common stop words (‘the’,’a’,‘on’,‘and’), performing case normalizing, breaking a work into ngrams (smaller pieces based on substrings), etc. to support full-text search. Is ES analysis happens at index-time and query-time.

Index Layout


Shards and Replicas

curl -XPUT localhost:9200/test -d ‘{
“settings”: {
  “number_of_shards”: 1,
  “number_of_replicas”: 0 }
  }’

curl -XPUT localhost:9200/test -d ‘{
“settings”: {
  “number_of_shards”: 3,
  “number_of_replicas”: 2}
  }’

Shard Placement

By default shards in ES are placed onto nodes by taking the the hash of the document id modulo #shards for the destination index.

Querying is more complex. Generally potential search hits are spread across all the shards for that index so the query is distributed to all shards and the results are combined somehow before being returned (scatter/ gather architecture).

Routing

url -XGET 'localhost:9200/test/product/_query?routing=electronics'

Routing can be used to control which shards (and therefore which nodes) receive requests to search for a document. When routing is enabled the user can specify a value at either index time or query time to determine which shards are used for indexing/querying. The same routing value is always routed to the same shard for a given index.

ES Document Model

• Documents first broken down into terms to create inverted index back to original source (more on this later)

• Document content is up to you and can be:

 ✴ unstructured (articles/tweets)

 ✴ semi-structured (log entries/emails)

 ✴ structured (patient records/emplyee records) or any combination thereof

• Queries can look for exact term matches (e.g. productCategory == entertainment) or “best match” based on scoring each document against search criteria

• All documents in ES have an associated index, type and id.

Analyzers

• In ES Analysis is the process of breaking down raw document text into terms that will be indexed in a single lucene index.

• The role of analysis is performed by Analyzers. Analyzers themselves are broken into logical parts:

 ✴ CharFilter: An optional component that directly modifies the underlying char stream for example to remove HTML tags or convert characters

 ✴ Tokenizer: Component that extracts multiple terms from a single text string

 ✴ TokenFilters: Component that modifies, adds or removes tokens for example to convert all characters to uppercase or remove common stopwords

• Can be index-specific or shared globally.

• ES ships with several common analyzers. You can also create a custom analyzers with a single logical name by specifying the CharFilter, Tokenizer and TokenFilters that comprise it.

Indexing a Document

• This will index all the fields of our document in the index named test with a type mapping of product an an id of 1

• Notice that we did not create any indexes ahead of time or define any information about the schema of the document we just indexed!

• ES returns a response JSON object acknowledging our operation

• Using POST method this time instead of PUT