我想使用ElasticSearch搜索文件名(而不是文件的内容)。因此,我需要找到文件名的一部分(完全匹配,没有模糊搜索)。
示例: 我有以下名称的文件:
My_first_file_created_at_2012.01.13.doc My_second_file_created_at_2012.01.13.pdf Another file.txt And_again_another_file.docx foo.bar.txt
现在,我要搜索2012.01.13以获取前两个文件。 搜索file或ile应返回除最后一个文件名以外的所有文件名。
2012.01.13
file
ile
如何使用ElasticSearch做到这一点?
这是我测试过的,但始终返回零结果:
curl -X DELETE localhost:9200/files curl -X PUT localhost:9200/files -d ' { "settings" : { "index" : { "analysis" : { "analyzer" : { "filename_analyzer" : { "type" : "custom", "tokenizer" : "lowercase", "filter" : ["filename_stop", "filename_ngram"] } }, "filter" : { "filename_stop" : { "type" : "stop", "stopwords" : ["doc", "pdf", "docx"] }, "filename_ngram" : { "type" : "nGram", "min_gram" : 3, "max_gram" : 255 } } } } }, "mappings": { "files": { "properties": { "filename": { "type": "string", "analyzer": "filename_analyzer" } } } } } ' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_first_file_created_at_2012.01.13.doc" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_second_file_created_at_2012.01.13.pdf" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "Another file.txt" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "And_again_another_file.docx" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "foo.bar.txt" }' curl -X POST "http://localhost:9200/files/_refresh" FILES=' http://localhost:9200/files/_search?q=filename:2012.01.13 ' for file in ${FILES} do echo; echo; echo ">>> ${file}" curl "${file}&pretty=true" done
您粘贴的内容存在各种问题:
1)不正确的映射
创建索引时,请指定:
"mappings": { "files": {
但实际上您的类型file不是files。如果您检查了映射,您将立即看到:
files
curl -XGET 'http://127.0.0.1:9200/files/_mapping?pretty=1' # { # "files" : { # "files" : { # "properties" : { # "filename" : { # "type" : "string", # "analyzer" : "filename_analyzer" # } # } # }, # "file" : { # "properties" : { # "filename" : { # "type" : "string" # } # } # } # } # }
2)分析仪定义不正确
您已经指定了lowercase令牌生成器,但是它删除了不是字母的任何内容(请参阅docs),因此您的数字已被完全删除。
lowercase
您可以使用analytics API进行检查:
curl -XGET 'http://127.0.0.1:9200/_analyze?pretty=1&text=My_file_2012.01.13.doc&tokenizer=lowercase' # { # "tokens" : [ # { # "end_offset" : 2, # "position" : 1, # "start_offset" : 0, # "type" : "word", # "token" : "my" # }, # { # "end_offset" : 7, # "position" : 2, # "start_offset" : 3, # "type" : "word", # "token" : "file" # }, # { # "end_offset" : 22, # "position" : 3, # "start_offset" : 19, # "type" : "word", # "token" : "doc" # } # ] # }
3)Ngram搜索
您在索引分析器和搜索分析器中都包括了ngram令牌过滤器。这对于索引分析器很好,因为您希望对ngram进行索引。但是,当您搜索时,您想搜索的是完整字符串,而不是每个ngram。
例如,如果您"abcd"使用长度为1到4的ngram进行索引,则最终将得到以下标记:
"abcd"
a b c d ab bc cd abc bcd
但是,如果您搜索"dcba"(不匹配),并且还使用ngrams分析搜索词,则实际上是在搜索:
"dcba"
d c b a dc cb ba dbc cba
因此a,b,c和d将匹配!
a
b
c
d
解
首先,您需要选择正确的分析仪。您的用户可能会搜索单词,数字或日期,但可能不会期望ile匹配file。相反,使用edge ngrams 可能会更有用,它会将ngram锚定到每个单词的开头(或结尾)。
另外,为什么要排除docx等等?用户肯定会想要搜索文件类型吗?
docx
因此,通过删除不是字母或数字的任何内容(使用模式tokenizer),将每个文件名分成较小的令牌:
My_first_file_2012.01.13.doc => my first file 2012 01 13 doc
然后对于索引分析器,我们还将在每个标记上使用边缘ngram:
my => m my first => f fi fir firs first file => f fi fil file 2012 => 2 20 201 201 01 => 0 01 13 => 1 13 doc => d do doc
我们创建索引如下:
curl -XPUT 'http://127.0.0.1:9200/files/?pretty=1' -d ' { "settings" : { "analysis" : { "analyzer" : { "filename_search" : { "tokenizer" : "filename", "filter" : ["lowercase"] }, "filename_index" : { "tokenizer" : "filename", "filter" : ["lowercase","edge_ngram"] } }, "tokenizer" : { "filename" : { "pattern" : "[^\\p{L}\\d]+", "type" : "pattern" } }, "filter" : { "edge_ngram" : { "side" : "front", "max_gram" : 20, "min_gram" : 1, "type" : "edgeNGram" } } } }, "mappings" : { "file" : { "properties" : { "filename" : { "type" : "string", "search_analyzer" : "filename_search", "index_analyzer" : "filename_index" } } } } } '
现在,测试我们的分析仪是否正常工作:
filename_search:
curl -XGET 'http://127.0.0.1:9200/files/_analyze?pretty=1&text=My_first_file_2012.01.13.doc&analyzer=filename_search' [results snipped] "token" : "my" "token" : "first" "token" : "file" "token" : "2012" "token" : "01" "token" : "13" "token" : "doc"
filename_index:
curl -XGET 'http://127.0.0.1:9200/files/_analyze?pretty=1&text=My_first_file_2012.01.13.doc&analyzer=filename_index' "token" : "m" "token" : "my" "token" : "f" "token" : "fi" "token" : "fir" "token" : "firs" "token" : "first" "token" : "f" "token" : "fi" "token" : "fil" "token" : "file" "token" : "2" "token" : "20" "token" : "201" "token" : "2012" "token" : "0" "token" : "01" "token" : "1" "token" : "13" "token" : "d" "token" : "do" "token" : "doc"
OK-似乎工作正常。因此,让我们添加一些文档:
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_first_file_created_at_2012.01.13.doc" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_second_file_created_at_2012.01.13.pdf" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "Another file.txt" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "And_again_another_file.docx" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "foo.bar.txt" }' curl -X POST "http://localhost:9200/files/_refresh"
并尝试搜索:
curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1' -d ' { "query" : { "text" : { "filename" : "2012.01" } } } ' # { # "hits" : { # "hits" : [ # { # "_source" : { # "filename" : "My_second_file_created_at_2012.01.13.pdf" # }, # "_score" : 0.06780553, # "_index" : "files", # "_id" : "PsDvfFCkT4yvJnlguxJrrQ", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_first_file_created_at_2012.01.13.doc" # }, # "_score" : 0.06780553, # "_index" : "files", # "_id" : "ER5RmyhATg-Eu92XNGRu-w", # "_type" : "file" # } # ], # "max_score" : 0.06780553, # "total" : 2 # }, # "timed_out" : false, # "_shards" : { # "failed" : 0, # "successful" : 5, # "total" : 5 # }, # "took" : 4 # }
成功!
####更新####
我意识到搜索2012.01将使两者匹配2012.01.12,2012.12.01因此我尝试将查询更改为使用文本短语查询。但是,这没有用。事实证明,边缘ngram过滤器会增加每个ngram的位置计数(而我本以为每个ngram的位置将与单词开头相同)。
2012.01
2012.01.12
2012.12.01
在点(3)中提到的问题使用时以上只是一个问题query_string,field或text查询它试图匹配任何令牌。但是,对于text_phrase查询,它将尝试以正确的顺序匹配所有令牌。
query_string
field
text
text_phrase
为了演示该问题,请使用不同的日期为另一个文档建立索引:
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_third_file_created_at_2012.12.01.doc" }' curl -X POST "http://localhost:9200/files/_refresh"
并执行与上述相同的搜索:
curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1' -d ' { "query" : { "text" : { "filename" : { "query" : "2012.01" } } } } ' # { # "hits" : { # "hits" : [ # { # "_source" : { # "filename" : "My_third_file_created_at_2012.12.01.doc" # }, # "_score" : 0.22097087, # "_index" : "files", # "_id" : "xmC51lIhTnWplOHADWJzaQ", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_first_file_created_at_2012.01.13.doc" # }, # "_score" : 0.13137488, # "_index" : "files", # "_id" : "ZUezxDgQTsuAaCTVL9IJgg", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_second_file_created_at_2012.01.13.pdf" # }, # "_score" : 0.13137488, # "_index" : "files", # "_id" : "XwLNnSlwSeyYtA2y64WuVw", # "_type" : "file" # } # ], # "max_score" : 0.22097087, # "total" : 3 # }, # "timed_out" : false, # "_shards" : { # "failed" : 0, # "successful" : 5, # "total" : 5 # }, # "took" : 5 # }
第一个结果的日期2012.12.01不是最匹配的日期2012.01。因此,仅匹配该确切短语,我们可以执行以下操作:
curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1' -d ' { "query" : { "text_phrase" : { "filename" : { "query" : "2012.01", "analyzer" : "filename_index" } } } } ' # { # "hits" : { # "hits" : [ # { # "_source" : { # "filename" : "My_first_file_created_at_2012.01.13.doc" # }, # "_score" : 0.55737644, # "_index" : "files", # "_id" : "ZUezxDgQTsuAaCTVL9IJgg", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_second_file_created_at_2012.01.13.pdf" # }, # "_score" : 0.55737644, # "_index" : "files", # "_id" : "XwLNnSlwSeyYtA2y64WuVw", # "_type" : "file" # } # ], # "max_score" : 0.55737644, # "total" : 2 # }, # "timed_out" : false, # "_shards" : { # "failed" : 0, # "successful" : 5, # "total" : 5 # }, # "took" : 7 # }
或者,如果您仍然要匹配所有3个文件(因为用户可能会记住文件名中的某些单词,但顺序错误),则可以运行两个查询,但要以正确的顺序增加文件名的重要性:
curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1' -d ' { "query" : { "bool" : { "should" : [ { "text_phrase" : { "filename" : { "boost" : 2, "query" : "2012.01", "analyzer" : "filename_index" } } }, { "text" : { "filename" : "2012.01" } } ] } } } ' # [Fri Feb 24 16:31:02 2012] Response: # { # "hits" : { # "hits" : [ # { # "_source" : { # "filename" : "My_first_file_created_at_2012.01.13.doc" # }, # "_score" : 0.56892186, # "_index" : "files", # "_id" : "ZUezxDgQTsuAaCTVL9IJgg", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_second_file_created_at_2012.01.13.pdf" # }, # "_score" : 0.56892186, # "_index" : "files", # "_id" : "XwLNnSlwSeyYtA2y64WuVw", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_third_file_created_at_2012.12.01.doc" # }, # "_score" : 0.012931341, # "_index" : "files", # "_id" : "xmC51lIhTnWplOHADWJzaQ", # "_type" : "file" # } # ], # "max_score" : 0.56892186, # "total" : 3 # }, # "timed_out" : false, # "_shards" : { # "failed" : 0, # "successful" : 5, # "total" : 5 # }, # "took" : 4 # }