我使用NLTKne_chunk从文本中提取命名实体:
ne_chunk
my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement." nltk.ne_chunk(my_sent, binary=True)
但是我不知道如何将这些实体保存到列表中?例如–
print Entity_list ('WASHINGTON', 'New York', 'Loretta', 'Brooklyn', 'African')
谢谢。
nltk.ne_chunk返回嵌套nltk.tree.Tree对象,因此您必须遍历该Tree对象才能到达网元。
nltk.ne_chunk
nltk.tree.Tree
Tree
看看带有正则表达式的命名实体识别:NLTK
>>> from nltk import ne_chunk, pos_tag, word_tokenize >>> from nltk.tree import Tree >>> >>> def get_continuous_chunks(text): ... chunked = ne_chunk(pos_tag(word_tokenize(text))) ... continuous_chunk = [] ... current_chunk = [] ... for i in chunked: ... if type(i) == Tree: ... current_chunk.append(" ".join([token for token, pos in i.leaves()])) ... if current_chunk: ... named_entity = " ".join(current_chunk) ... if named_entity not in continuous_chunk: ... continuous_chunk.append(named_entity) ... current_chunk = [] ... else: ... continue ... return continuous_chunk ... >>> my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement." >>> get_continuous_chunks(my_sent) ['WASHINGTON', 'New York', 'Loretta E. Lynch', 'Brooklyn'] >>> my_sent = "How's the weather in New York and Brooklyn" >>> get_continuous_chunks(my_sent) ['New York', 'Brooklyn']