我想获取满足一定条件的组数。用SQL术语来说,我想在Elasticsearch中执行以下操作。
SELECT COUNT(*) FROM ( SELECT senderResellerId, SUM(requestAmountValue) AS t_amount FROM transactions GROUP BY senderResellerId HAVING t_amount > 10000 ) AS dum;
到目前为止,我可以通过术语聚合将senderResellerId分组。但是,当我应用过滤器时,它无法按预期工作。
弹性请求
{ "aggregations": { "reseller_sale_sum": { "aggs": { "sales": { "aggregations": { "reseller_sale": { "sum": { "field": "requestAmountValue" } } }, "filter": { "range": { "reseller_sale": { "gte": 10000 } } } } }, "terms": { "field": "senderResellerId", "order": { "sales>reseller_sale": "desc" }, "size": 5 } } }, "ext": {}, "query": { "match_all": {} }, "size": 0 }
实际反应
{ "took" : 21, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 150824, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "reseller_sale_sum" : { "doc_count_error_upper_bound" : -1, "sum_other_doc_count" : 149609, "buckets" : [ { "key" : "RES0000000004", "doc_count" : 8, "sales" : { "doc_count" : 0, "reseller_sale" : { "value" : 0.0 } } }, { "key" : "RES0000000005", "doc_count" : 39, "sales" : { "doc_count" : 0, "reseller_sale" : { "value" : 0.0 } } }, { "key" : "RES0000000006", "doc_count" : 57, "sales" : { "doc_count" : 0, "reseller_sale" : { "value" : 0.0 } } }, { "key" : "RES0000000007", "doc_count" : 134, "sales" : { "doc_count" : 0, "reseller_sale" : { "value" : 0.0 } } } } } ] } } }
从上面的响应中可以看到,它正在返回代理商,但是 reseller_sale 聚合结果为零。
更多细节在这里。
您可以使用一种pipeline aggregations,即存储桶选择器聚合。查询如下所示:
pipeline aggregations
POST my_index/tdrs/_search { "aggregations": { "reseller_sale_sum": { "aggregations": { "sales": { "sum": { "field": "requestAmountValue" } }, "max_sales": { "bucket_selector": { "buckets_path": { "var1": "sales" }, "script": "params.var1 > 10000" } } }, "terms": { "field": "senderResellerId", "order": { "sales": "desc" }, "size": 5 } } }, "size": 0 }
将以下文档放入索引后:
"hits": [ { "_index": "my_index", "_type": "tdrs", "_id": "AV9Yh5F-dSw48Z0DWDys", "_score": 1, "_source": { "requestAmountValue": 7000, "senderResellerId": "ID_1" } }, { "_index": "my_index", "_type": "tdrs", "_id": "AV9Yh684dSw48Z0DWDyt", "_score": 1, "_source": { "requestAmountValue": 5000, "senderResellerId": "ID_1" } }, { "_index": "my_index", "_type": "tdrs", "_id": "AV9Yh8TBdSw48Z0DWDyu", "_score": 1, "_source": { "requestAmountValue": 1000, "senderResellerId": "ID_2" } } ]
查询的结果是:
"aggregations": { "reseller_sale_sum": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "ID_1", "doc_count": 2, "sales": { "value": 12000 } } ] } }
即仅那些senderResellerId累计销售额为的者>10000。
senderResellerId
>10000
要实现等效功能,SELECT COUNT(*) FROM (... HAVING)可以结合使用桶脚本聚合和总和桶聚合。尽管似乎没有直接的方法来计算bucket_selector实际选择了多少个存储桶,但我们可以定义一个bucket_script产生0或1取决于条件的sum_bucket产生它的sum:
SELECT COUNT(*) FROM (... HAVING)
bucket_selector
bucket_script
0
1
sum_bucket
sum
POST my_index/tdrs/_search { "aggregations": { "reseller_sale_sum": { "aggregations": { "sales": { "sum": { "field": "requestAmountValue" } }, "max_sales": { "bucket_script": { "buckets_path": { "var1": "sales" }, "script": "if (params.var1 > 10000) { 1 } else { 0 }" } } }, "terms": { "field": "senderResellerId", "order": { "sales": "desc" } } }, "max_sales_stats": { "sum_bucket": { "buckets_path": "reseller_sale_sum>max_sales" } } }, "size": 0 }
输出将是:
"aggregations": { "reseller_sale_sum": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ ... ] }, "max_sales_stats": { "value": 1 } }
所需的存储桶数位于中max_sales_stats.value。
max_sales_stats.value
我必须指出两件事:
管道聚合工作于其他聚合而不是文档集所产生的输出,从而将信息添加到输出树中。
这意味着bucket_selector将在上的terms聚合结果之后和之后应用聚合senderResellerId。例如,如果聚合定义senderResellerId不止size一个terms,则不会使用来获得集合中的 所有 id sum(sales) > 10000,而只会获取出现在terms聚合输出中的id 。考虑使用排序和/或设置足够的size参数。
terms
size
sum(sales) > 10000
这也适用于第二种情况,COUNT() (... HAVING)该情况仅计算聚合输出中实际存在的那些存储桶。
COUNT() (... HAVING)
如果此查询太重或存储桶数太大,请考虑对数据进行规范化或将此总和直接存储在文档中,以便可以使用普通range查询来实现您的目标。
range
希望有帮助!