我想对Elasticsearch的电子邮件或电话进行模糊匹配。例如:
匹配所有以结尾的电子邮件 @gmail.com
@gmail.com
要么
匹配所有电话开头136。
136
我知道我可以使用通配符
{ "query": { "wildcard" : { "email": "*gmail.com" } } }
但是性能很差。我尝试使用regexp:
{"query": {"regexp": {"email": {"value": "*163\.com*"} } } }
但是不起作用。
有更好的方法吗?
curl -XGET本地主机:9200 / user_data
{ "user_data": { "aliases": {}, "mappings": { "user_data": { "properties": { "address": { "type": "string" }, "age": { "type": "long" }, "comment": { "type": "string" }, "created_on": { "type": "date", "format": "dateOptionalTime" }, "custom": { "properties": { "key": { "type": "string" }, "value": { "type": "string" } } }, "gender": { "type": "string" }, "name": { "type": "string" }, "qq": { "type": "string" }, "tel": { "type": "string" }, "updated_on": { "type": "date", "format": "dateOptionalTime" }, } } }, "settings": { "index": { "creation_date": "1458832279465", "uuid": "Fbmthc3lR0ya51zCnWidYg", "number_of_replicas": "1", "number_of_shards": "5", "version": { "created": "1070299" } } }, "warmers": {} } }
映射:
{ "settings": { "analysis": { "analyzer": { "index_phone_analyzer": { "type": "custom", "char_filter": [ "digit_only" ], "tokenizer": "digit_edge_ngram_tokenizer", "filter": [ "trim" ] }, "search_phone_analyzer": { "type": "custom", "char_filter": [ "digit_only" ], "tokenizer": "keyword", "filter": [ "trim" ] }, "index_email_analyzer": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "name_ngram_filter", "trim" ] }, "search_email_analyzer": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "trim" ] } }, "char_filter": { "digit_only": { "type": "pattern_replace", "pattern": "\\D+", "replacement": "" } }, "tokenizer": { "digit_edge_ngram_tokenizer": { "type": "edgeNGram", "min_gram": "3", "max_gram": "15", "token_chars": [ "digit" ] } }, "filter": { "name_ngram_filter": { "type": "ngram", "min_gram": "3", "max_gram": "20" } } } }, "mappings" : { "user_data" : { "properties" : { "name" : { "type" : "string", "analyzer" : "ik" }, "age" : { "type" : "integer" }, "gender": { "type" : "string" }, "qq" : { "type" : "string" }, "email" : { "type" : "string", "analyzer": "index_email_analyzer", "search_analyzer": "search_email_analyzer" }, "tel" : { "type" : "string", "analyzer": "index_phone_analyzer", "search_analyzer": "search_phone_analyzer" }, "address" : { "type": "string", "analyzer" : "ik" }, "comment" : { "type" : "string", "analyzer" : "ik" }, "created_on" : { "type" : "date", "format" : "dateOptionalTime" }, "updated_on" : { "type" : "date", "format" : "dateOptionalTime" }, "custom": { "type" : "nested", "properties" : { "key" : { "type" : "string" }, "value" : { "type" : "string" } } } } } } }
一种简单的方法是创建一个自定义分析器,该分析器使用电子邮件的n-gram令牌过滤器(=>参见下文index_email_analyzer,search_email_analyzer+ email_url_analyzer进行精确的电子邮件匹配)和电话的edge- ngram令牌过滤器(=>参见下文index_phone_analyzer和search_phone_analyzer)。
index_email_analyzer
search_email_analyzer
email_url_analyzer
index_phone_analyzer
search_phone_analyzer
完整的索引定义在下面提供。
PUT myindex { "settings": { "analysis": { "analyzer": { "email_url_analyzer": { "type": "custom", "tokenizer": "uax_url_email", "filter": [ "trim" ] }, "index_phone_analyzer": { "type": "custom", "char_filter": [ "digit_only" ], "tokenizer": "digit_edge_ngram_tokenizer", "filter": [ "trim" ] }, "search_phone_analyzer": { "type": "custom", "char_filter": [ "digit_only" ], "tokenizer": "keyword", "filter": [ "trim" ] }, "index_email_analyzer": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "name_ngram_filter", "trim" ] }, "search_email_analyzer": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "trim" ] } }, "char_filter": { "digit_only": { "type": "pattern_replace", "pattern": "\\D+", "replacement": "" } }, "tokenizer": { "digit_edge_ngram_tokenizer": { "type": "edgeNGram", "min_gram": "1", "max_gram": "15", "token_chars": [ "digit" ] } }, "filter": { "name_ngram_filter": { "type": "ngram", "min_gram": "1", "max_gram": "20" } } } }, "mappings": { "your_type": { "properties": { "email": { "type": "string", "analyzer": "index_email_analyzer", "search_analyzer": "search_email_analyzer" }, "phone": { "type": "string", "analyzer": "index_phone_analyzer", "search_analyzer": "search_phone_analyzer" } } } } }
现在,让我们一点一点地剖析它。
对于该phone字段,其想法是使用来索引电话值index_phone_analyzer,该索引使用edge- ngram标记器来索引电话号码的所有前缀。所以,如果您的电话号码1362435647,下面的标记会产生:1,13,136,1362,13624,136243,1362435,13624356,13624356,136243564,1362435647。
phone
1362435647
1
13
1362
13624
136243
1362435
13624356
136243564
然后,在搜索时,我们使用另一个分析器search_phone_analyzer,该分析器将简单地获取输入数字(例如136),并phone使用简单match或term查询将其与字段进行匹配:
match
term
POST myindex { "query": { "term": { "phone": "136" } } }
对于该email字段,我们以类似的方式进行操作,因为我们使用来对电子邮件值进行索引,该索引index_email_analyzer使用了ngram令牌过滤器,该过滤器将生成所有可能的长度不同(在1到20个字符之间)的令牌,这些令牌可以从电子邮件值。例如:john@gmail.com将被标记化到j,jo,joh,… gmail.com,… john@gmail.com。
email
john@gmail.com
j
jo
joh
gmail.com
然后在搜索时,我们将使用另一个名为的分析器search_email_analyzer,它将接受输入并尝试将其与索引标记进行匹配。
POST myindex { "query": { "term": { "email": "@gmail.com" } } }
该email_url_analyzer分析仪并没有在本例中使用,但我已经为了以防万一,你需要确切的电子邮件值匹配包括它。