ElasticSearch入门系列（三）文档，索引，搜索和聚合

2022-07-10 08:10:43

一、文档

在实际使用中的对象往往拥有复杂的数据结构

Elasticsearch是面向文档的，这意味着他可以存储整个对象或文档，然而他不仅仅是存储，还会索引每个文档的内容使之可以被搜索，在Elasticsearch中可以对文档进行索引、搜索、排序、过滤。

Elasticsearch使用JSON作为文档序列化格式。

使用json表示一个用户对象：

{

    "email":      "john@smith.com",

    "first_name": "John",

    "last_name":  "Smith",

    "info": {

        "bio":         "Eco-warrior and defender of the weak",

        "age":         25,

        "interests": [ "dolphins", "whales" ]

    },

    "join_date": "2014/05/01"

}

经原始的user对象很复杂但他的结构和对象的含义已经被完整的体现在JSON中

简单的开始教程：建立员工搜索目录

二、索引

首先要做的是存储员工数据，每个文档代表一个员工，在ElasticSearch中存储数据的行为叫做索引，不过在索引之前，需要明确数据应该存储在哪里。

在elasticsearch中，文档归属于一种类型，而这些类型存在于索引中

elasticsearch与传统数据库的比较

Relational DB ->Databases ->Tables -> Rows ->Columns

Elasticsearch -> Indices ->Types -> Documents ->Fields

Elasticsearch集群可以包含多个索引（indices）(数据库)，每一个索引可以包含多个类型（type），一个类型包含多个文档（documents）(行)，然后每个文档包含多个字段（fields）（列）

默认情况下，文档中的所有字段都会被索引（拥有一个倒排索引），只有这样他们才是可被搜索的。

因此为了做上述的员工目录，我们将做如下操作：

为每个员工的文档（document）建立索引，每个文档包含了相应员工的所有信息

每个文档的类型为employee

employee类型归属于索引megacorp

megacorp索引存储在ElasticSearch集群中

PUT /megacorp/employee/1

{

    "first_name" : "John",

    "last_name" :  "Smith",

    "age" :        25,

    "about" :      "I love to go rock climbing",

    "interests": [ "sports", "music" ]

}

我们看到path：/magecorp/employee/1包含三部分信息：

megacorp 索引名

employee 类型名

1 这个员工的ID

请求实体（JSON文档包含了这个员工的所有信息。

我们不需要用做额外的管理操作，比如创建索引或者定义每个字段的数据类型，我们能够直接索引文档，Elasticsearch已经内置所有的缺省设置，所有管理操作都是透明的。

按照统一的样式加入更多的员工信息、

三、检索

现在Elasticsearch中已经存储了一些数据。

①：检索单个员工的信息：执行HTTP GET请求并指出文档的“地址”--索引、类型和ID

GET /megacorp/employee/1 响应结果中包含一些文档的元信息

{

  "_index" :   "megacorp",

  "_type" :    "employee",

  "_id" :      "1",

  "_version" : 1,

  "found" :    true,

  "_source" :  {

      "first_name" :  "John",

      "last_name" :   "Smith",

      "age" :         25,

      "about" :       "I love to go rock climbing",

      "interests":  [ "sports", "music" ]

  }

}

我们通过HTTP方法GET来检索翁当，同样，我们可以使用DELETE方法删除文档，使用HEAD方法检查某文档是否存在，如果想要更新已存在的文文档，我们只需再PUT一次。

②：搜索全部的员工

GET /megacorp/employee/_search 默认返回前10个结果：

{

   "took":      6,

   "timed_out": false,

   "_shards": { ... },

   "hits": {

      "total":      3,

      "max_score":  1,

      "hits": [

         {

            "_index":         "megacorp",

            "_type":          "employee",

            "_id":            "3",

            "_score":         1,

            "_source": {

               "first_name":  "Douglas",

               "last_name":   "Fir",

               "age":         35,

               "about":       "I like to build cabinets",

               "interests": [ "forestry" ]

            }

         },

         {

            "_index":         "megacorp",

            "_type":          "employee",

            "_id":            "1",

            "_score":         1,

            "_source": {

               "first_name":  "John",

               "last_name":   "Smith",

               "age":         25,

               "about":       "I love to go rock climbing",

               "interests": [ "sports", "music" ]

            }

         },

         {

            "_index":         "megacorp",

            "_type":          "employee",

            "_id":            "2",

            "_score":         1,

            "_source": {

               "first_name":  "Jane",

               "last_name":   "Smith",

               "age":         32,

               "about":       "I like to collect rock albums",

               "interests": [ "music" ]

            }

         }

      ]

   }

}

响应内容不仅会告诉我们哪些文档被匹配到，而且这些文档内容完整的被包含在其中

③：搜索姓氏中包含Smith的员工。我们要用到查询字符串（query string）搜索

GET /megacorp/employee/_search?q=last_name:Smith

请求中依旧使用_search关键字，然后将查询语句传递给参数q=

{

   ...

   "hits": {

      "total":      2,

      "max_score":  0.30685282,

      "hits": [

         {

            ...

            "_source": {

               "first_name":  "John",

               "last_name":   "Smith",

               "age":         25,

               "about":       "I love to go rock climbing",

               "interests": [ "sports", "music" ]

            }

         },

         {

            ...

            "_source": {

               "first_name":  "Jane",

               "last_name":   "Smith",

               "age":         32,

               "about":       "I like to collect rock albums",

               "interests": [ "music" ]

            }

         }

      ]

   }

}

④：使用DSL语句查询

查询字符串搜索便于通过命令行完成特定的搜索，但是他也有局限性，Elasticsearch提供丰富且灵活的查询语言叫做DSL查询（Query DSL）它允许构建更加复杂、强大的查询、

DSL(Domain Specific Language特定领域语言)以JSON请求体的形式出现，例如将之前查询姓氏Smith的方法变为：

GET /megacorp/employee/_search

{

    "query" : {

        "match" : {

            "last_name" : "Smith"

        }

    }

}

与之前结果一样，只是不再使用查询字符串作为参数，而是使用请求体代替，其中使用了match语句。

⑤：复杂的查询

修改上例为查询姓氏Smith并且年龄大于30岁的员工，我们的语句将添加过滤器。

GET /megacorp/employee/_search

{

    "query" : {

        "filtered" : {

            "filter" : {

                "range" : {

                    "age" : { "gt" : 30 } <1>

                }

            },

            "query" : {

                "match" : {

                    "last_name" : "smith" <2>

                }

            }

        }

    }

}

<1>这部分查询属于区间过滤器，他用于查找所有年龄大于30岁的数据

<2>这部分查询与之前的match语句一致

结果显示为：

{

   ...

   "hits": {

      "total":      1,

      "max_score":  0.30685282,

      "hits": [

         {

            ...

            "_source": {

               "first_name":  "Jane",

               "last_name":   "Smith",

               "age":         32,

               "about":       "I like to collect rock albums",

               "interests": [ "music" ]

            }

         }

      ]

   }

}

⑥：全文搜索

以上的搜索都很简单：搜索特定的名字，通过年龄筛选。以下我们来看全文搜索。

比如我们搜索所有喜欢“rock climbing”的员工

GET /megacorp/employee/_search

{

    "query" : {

        "match" : {

            "about" : "rock climbing"

        }

    }

}

使用了之前的match查询

结果为：

{

   ...

   "hits": {

      "total":      2,

      "max_score":  0.16273327,

      "hits": [

         {

            ...

            "_score":         0.16273327, <1>

            "_source": {

               "first_name":  "John",

               "last_name":   "Smith",

               "age":         25,

               "about":       "I love to go rock climbing",

               "interests": [ "sports", "music" ]

            }

         },

         {

            ...

            "_score":         0.016878016, <2>

            "_source": {

               "first_name":  "Jane",

               "last_name":   "Smith",

               "age":         32,

               "about":       "I like to collect rock albums",

               "interests": [ "music" ]

            }

         }

      ]

   }

}

<1><2>为结果相关性评分

默认情况下，Elasticsearch根据结果相关性评分来对结果进行排序，所谓的结果相关性评分就是文档与查询条件的匹配程度

⑦：短语搜索

确切的匹配单词或短语只要将match变为match_phrase查询即可：

GET /megacorp/employee/_search

{

    "query" : {

        "match_phrase" : {

            "about" : "rock climbing"

        }

    }

}

结果为：

{

   ...

   "hits": {

      "total":      1,

      "max_score":  0.23013961,

      "hits": [

         {

            ...

            "_score":         0.23013961,

            "_source": {

               "first_name":  "John",

               "last_name":   "Smith",

               "age":         25,

               "about":       "I love to go rock climbing",

               "interests": [ "sports", "music" ]

            }

         }

      ]

   }

}

⑧：高亮我们的搜索

在之前的语句上增加highlight参数：

GET /megacorp/employee/_search

{

    "query" : {

        "match_phrase" : {

            "about" : "rock climbing"

        }

    },

    "highlight": {

        "fields" : {

            "about" : {}

        }

    }

}

结果为：并且用<em>标签来标识匹配的单词

{

   ...

   "hits": {

      "total":      1,

      "max_score":  0.23013961,

      "hits": [

         {

            ...

            "_score":         0.23013961,

            "_source": {

               "first_name":  "John",

               "last_name":   "Smith",

               "age":         25,

               "about":       "I love to go rock climbing",

               "interests": [ "sports", "music" ]

            },

            "highlight": {

               "about": [

                  "I love to go <em>rock</em> <em>climbing</em>" <1>

               ]

            }

         }

      ]

   }

}

<1>原有文本中高亮的片段

四、聚合

Elasticsearch有一个功能叫做聚合（aggregations）他允许在数据上生成复杂的分析统计，就像SQL中的GROUP BY,但是功能上更强大。

比如查看员工中最大的共同点是什么

GET /megacorp/employee/_search

{

  "aggs": {

    "all_interests": {

      "terms": { "field": "interests" }

    }

  }

}

结果：

{

   ...

   "hits": { ... },

   "aggregations": {

      "all_interests": {

         "buckets": [

            {

               "key":       "music",

               "doc_count": 2

            },

            {

               "key":       "forestry",

               "doc_count": 1

            },

            {

               "key":       "sports",

               "doc_count": 1

            }

         ]

      }

   }

}

我们可以看到结果中匹配的数据。

如果我们要增加条件，比如增加姓氏为Smith的最大兴趣爱好，只要加过滤就好：

GET /megacorp/employee/_search

{

  "query": {

    "match": {

      "last_name": "smith"

    }

  },

  "aggs": {

    "all_interests": {

      "terms": {

        "field": "interests"

      }

    }

  }

}

结果：

...

  "all_interests": {

     "buckets": [

        {

           "key": "music",

           "doc_count": 2

        },

        {

           "key": "sports",

           "doc_count": 1

        }

     ]

  }

聚合页允许分级汇总，比如统计每种兴趣下职工的平均年龄：

GET /megacorp/employee/_search

{

    "aggs" : {

        "all_interests" : {

            "terms" : { "field" : "interests" },

            "aggs" : {

                "avg_age" : {

                    "avg" : { "field" : "age" }

                }

            }

        }

    }

}

结果：

...

  "all_interests": {

     "buckets": [

        {

           "key": "music",

           "doc_count": 2,

           "avg_age": {

              "value": 28.5

           }

        },

        {

           "key": "forestry",

           "doc_count": 1,

           "avg_age": {

              "value": 35

           }

        },

        {

           "key": "sports",

           "doc_count": 1,

           "avg_age": {

              "value": 25

           }

        }

     ]

  }

码农公寓

相关文章