[Feature][Elasticsearch] 为什么内容相同的 document,query match 得到的 score 却不一样?

结论

因为 score 的计算范围是单个 shard,而不是整个 index。

在 tf/idf 算法中,需要计算 docFreq 和 docCount。
基本上只要 shard 不同,得到的结果就不太可能一样,于是最后得到的 score 也会不一样。

测试步骤

1、建 index,定制 mapping

PUT luzhe
{
  "settings": {
    "index": {
      "number_of_shards": 2
    },
    "analysis": {
      "filter": {
        "my_ngram": {
          "type": "ngram",
          "min_gram": "1",
          "max_gram": "20"
        }
      },
      "analyzer": {
        "ngram_analyzer": {
          "filter": [
            "lowercase",
            "my_ngram"
          ],
          "type": "custom",
          "tokenizer": "standard"
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "title": {
          "analyzer": "ngram_analyzer",
          "term_vector": "with_positions_offsets",
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 50,
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

注意事项:

  • index.number_of_shards = 2,是为了尽快达到不同分片文档数不一致的情况。

2、插入 document(基本上插 3 条就够用了)

PUT luzhe/_doc/1
{
  "title": "大学语文"
}

PUT luzhe/_doc/2
{
  "title": "大学语文"
}

PUT luzhe/_doc/3
{
  "title": "大学语文"
}

3、查看两个 shard 的文档数

GET luzhe/_doc/_search?preference=_shards:0
GET luzhe/_doc/_search?preference=_shards:1

备注:在我的测试环境里,正好shard 0里有一个文档,shard 1里有两个文档。
只要俩分片文档数都不为0且不一样多就可以了。

4、测试

GET luzhe/_search
{
  "query": {
    "match": {
      "title": "语文"
    }
  }
}

输出:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.5753642,
    "hits" : [
      {
        "_index" : "luzhe",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.5753642,
        "_source" : {
          "title" : "大学语文"
        }
      },
      {
        "_index" : "luzhe",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.36464313,
        "_source" : {
          "title" : "大学语文"
        }
      },
      {
        "_index" : "luzhe",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.36464313,
        "_source" : {
          "title" : "大学语文"
        }
      }
    ]
  }
}

5、explain score

GET luzhe/_doc/3/_explain
{
  "query": {
    "match": {
      "title": "语文"
    }
  }
}

输出

{
  "_index" : "luzhe",
  "_type" : "_doc",
  "_id" : "3",
  "matched" : true,
  "explanation" : {
    "value" : 0.5753642,
    "description" : "sum of:",
    "details" : [
      {
        "value" : 0.2876821,
        "description" : "weight(title:语 in 0) [PerFieldSimilarity], result of:",
        "details" : [
          {
            "value" : 0.2876821,
            "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
            "details" : [
              {
                "value" : 0.2876821,
                "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                "details" : [
                  {
                    "value" : 1.0,
                    "description" : "docFreq",
                    "details" : [ ]
                  },
                  {
                    "value" : 1.0,
                    "description" : "docCount",
                    "details" : [ ]
                  }
                ]
              },
              {
                "value" : 1.0,
                "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                "details" : [
                  {
                    "value" : 1.0,
                    "description" : "termFreq=1.0",
                    "details" : [ ]
                  },
                  {
                    "value" : 1.2,
                    "description" : "parameter k1",
                    "details" : [ ]
                  },
                  {
                    "value" : 0.75,
                    "description" : "parameter b",
                    "details" : [ ]
                  },
                  {
                    "value" : 4.0,
                    "description" : "avgFieldLength",
                    "details" : [ ]
                  },
                  {
                    "value" : 4.0,
                    "description" : "fieldLength",
                    "details" : [ ]
                  }
                ]
              }
            ]
          }
        ]
      },
      {
        "value" : 0.2876821,
        "description" : "weight(title:文 in 0) [PerFieldSimilarity], result of:",
        "details" : [
          {
            "value" : 0.2876821,
            "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
            "details" : [
              {
                "value" : 0.2876821,
                "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                "details" : [
                  {
                    "value" : 1.0,
                    "description" : "docFreq",
                    "details" : [ ]
                  },
                  {
                    "value" : 1.0,
                    "description" : "docCount",
                    "details" : [ ]
                  }
                ]
              },
              {
                "value" : 1.0,
                "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                "details" : [
                  {
                    "value" : 1.0,
                    "description" : "termFreq=1.0",
                    "details" : [ ]
                  },
                  {
                    "value" : 1.2,
                    "description" : "parameter k1",
                    "details" : [ ]
                  },
                  {
                    "value" : 0.75,
                    "description" : "parameter b",
                    "details" : [ ]
                  },
                  {
                    "value" : 4.0,
                    "description" : "avgFieldLength",
                    "details" : [ ]
                  },
                  {
                    "value" : 4.0,
                    "description" : "fieldLength",
                    "details" : [ ]
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

会发现不同的id,docFreq 和 docCount 的结果不同。

上一篇:网页优化的三大标签TDK


下一篇:编程学习找项目篇:你会在 GitHub 上面找项目吗?我会哦!