我希望能够查询文本,但也只能检索数据中某个整数字段的最大值的结果。我已经阅读了有关聚合和过滤器的文档,但我不太清楚自己在寻找什么。
例如,我有一些重复的数据得到索引,除了整数字段外,这些数据都是相同的-我们称这个字段为lastseen。
lastseen
因此,作为示例,给定将这些数据放入elasticsearch中:
// these two the same except "lastseen" field curl -XPOST localhost:9200/myindex/myobject -d '{ "field1": "dinner carrot potato broccoli", "field2": "something here", "lastseen": 1000 }' curl -XPOST localhost:9200/myindex/myobject -d '{ "field1": "dinner carrot potato broccoli", "field2": "something here", "somevalue": 100 }' # and these two the same except "lastseen" field curl -XPOST localhost:9200/myindex/myobject -d '{ "field1": "fish chicken something", "field2": "dinner", "lastseen": 2000 }' curl -XPOST localhost:9200/myindex/myobject -d '{ "field1": "fish chicken something", "field2": "dinner", "lastseen": 200 }'
如果我查询 "dinner"
"dinner"
curl -XPOST localhost:9200/myindex -d '{ "query": { "query_string": { "query": "dinner" } } }'
我会得到4个结果。我想要一个过滤器,这样我只能得到两个结果-仅包含具有最大lastseen字段的项目。
这 显然 是 不对的 ,但希望它能使您对我的追求有一个了解:
{ "query": { "query_string": { "query": "dinner" } }, "filter": { "max": "lastseen" } }
结果如下所示:
"hits": [ { ... "_source": { "field1": "dinner carrot potato broccoli", "field2": "something here", "lastseen": 1000 } }, { ... "_source": { "field1": "fish chicken something", "field2": "dinner", "lastseen": 2000 } } ]
更新1: 我尝试创建一个不lastseen包含在索引中的映射。这没有用。仍会取回所有4个结果。
curl -XPOST localhost:9200/myindex -d '{ "mappings": { "myobject": { "properties": { "lastseen": { "type": "long", "store": "yes", "include_in_all": false } } } } }'
更新2: 我尝试使用此处列出的agg方案进行重复数据删除,但该方法不起作用,但更重要的是,我没有找到将其与关键字搜索结合的方法。
不理想,但是我认为它可以满足您的需求。
field1假设您是用来定义“重复”文档的字段,请更改字段的映射,如下所示:
field1
PUT /lastseen { "mappings": { "test": { "properties": { "field1": { "type": "string", "fields": { "raw": { "type": "string", "index": "not_analyzed" } } }, "field2": { "type": "string" }, "lastseen": { "type": "long" } } } } }
意思是,您添加了一个.raw子字段,not_analyzed这意味着将按原样对它进行索引,而无需进行分析并将其分解为术语。这是为了使有些“重复的文档发现”成为可能。
.raw
not_analyzed
然后,您需要在上使用terms聚合field1.raw(用于重复项)和top_hits子聚合,以获取每个field1值的单个文档:
terms
field1.raw
top_hits
GET /lastseen/test/_search { "size": 0, "query": { "query_string": { "query": "dinner" } }, "aggs": { "field1_unique": { "terms": { "field": "field1.raw", "size": 2 }, "aggs": { "first_one": { "top_hits": { "size": 1, "sort": [{"lastseen": {"order":"desc"}}] } } } } } }
此外,传回的那个单一文件top_hits是最高的lastseen(可能使"sort": [{"lastseen": {"order":"desc"}}])。
"sort": [{"lastseen": {"order":"desc"}}]
您将获得的结果是这些(在aggregationsnot 之下hits):
aggregations
hits
... "aggregations": { "field1_unique": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "dinner carrot potato broccoli", "doc_count": 2, "first_one": { "hits": { "total": 2, "max_score": null, "hits": [ { "_index": "lastseen", "_type": "test", "_id": "AU60ZObtjKWeJgeyudI-", "_score": null, "_source": { "field1": "dinner carrot potato broccoli", "field2": "something here", "lastseen": 1000 }, "sort": [ 1000 ] } ] } } }, { "key": "fish chicken something", "doc_count": 2, "first_one": { "hits": { "total": 2, "max_score": null, "hits": [ { "_index": "lastseen", "_type": "test", "_id": "AU60ZObtjKWeJgeyudJA", "_score": null, "_source": { "field1": "fish chicken something", "field2": "dinner", "lastseen": 2000 }, "sort": [ 2000 ] } ] } } } ] } }