我正在尝试使用Python将一个单词与列表中的其他单词进行比较,并检索最相似的列表。为此,我使用了difflib.get_close_matches函数。我正在使用Python 2.6.5的相对较新且功能强大的Windows 7便携式计算机。
from multiprocessing import Pool import random, time, difflib # constants wordlist = ["".join([random.choice([letter for letter in "abcdefghijklmnopqersty"]) for lengthofword in xrange(5)]) for nrofwords in xrange(1000000)] mainword = "hello" # comparison function def findclosematch(subwordlist): matches = difflib.get_close_matches(mainword,subwordlist,len(subwordlist),0.7) if matches <> []: return matches # pool print "pool method" if __name__ == '__main__': pool = Pool(processes=3) t=time.time() result = pool.map_async(findclosematch, wordlist, chunksize=100) #do something with result for r in result.get(): pass print time.time()-t # normal print "normal method" t=time.time() # run function result = findclosematch(wordlist) # do something with results for r in result: pass print time.time()-t
要找到的单词是“ hello”,要查找紧密匹配的单词的列表是一百万个长列表,其中包含5个随机连接的字符(仅用于说明目的)。我使用3个处理器内核和一个mapmap函数,其chunksize为100(我认为每个工人要处理的项目?)(我也尝试了1000和10000的chunksize,但没有真正的区别)。请注意,在这两种方法中,我都在调用函数之前立即启动计时器,并在遍历结果之后立即终止计时器。正如您在下面看到的那样,计时结果显然支持原始的非Pool方法:
>>> pool method 37.1690001488 seconds normal method 10.5329999924 seconds >>>
The Pool method is almost 4 times slower than the original method. Is there something I am missing here, or maybe misunderstanding about how the Pooling/multiprocessing works? I do suspect that part of the problem here could be that the map function returns None and so adds thousands of unneccessary items to the resultslist even though I only want actual matches to be returned to the results and have written it as such in the function. From what I understand that is just how map works. I have heard about some other functions like filter that only collects non-False results, but I dont think that multiprocessing/Pool supports the filter method. Are there any other functions besides map/imap in the multiprocessing module that could help me out in only returning what my function returns? Apply function is more for giving multiple arguments as I understand it.
似乎减速与其他进程的启动时间慢有关。我无法让.Pool()函数足够快。我要使其更快的最终解决方案是手动拆分工作负载列表,使用多个.Process()而不是.Pool(),然后在队列中返回解决方案。但我想知道,最关键的变化是否可能是根据要查找的主词而不是要比较的词来划分工作量,也许是因为difflib搜索功能已经如此之快。这是同时运行5个进程的新代码,与运行简单代码(6秒vs 55秒)相比,新代码的运行速度提高了大约10倍。除了difflib已经有多快之外,它对于快速模糊查找非常有用。
from multiprocessing import Process, Queue import difflib, random, time def f2(wordlist, mainwordlist, q): for mainword in mainwordlist: matches = difflib.get_close_matches(mainword,wordlist,len(wordlist),0.7) q.put(matches) if __name__ == '__main__': # constants (for 50 input words, find closest match in list of 100 000 comparison words) q = Queue() wordlist = ["".join([random.choice([letter for letter in "abcdefghijklmnopqersty"]) for lengthofword in xrange(5)]) for nrofwords in xrange(100000)] mainword = "hello" mainwordlist = [mainword for each in xrange(50)] # normal approach t = time.time() for mainword in mainwordlist: matches = difflib.get_close_matches(mainword,wordlist,len(wordlist),0.7) q.put(matches) print time.time()-t # split work into 5 or 10 processes processes = 5 def splitlist(inlist, chunksize): return [inlist[x:x+chunksize] for x in xrange(0, len(inlist), chunksize)] print len(mainwordlist)/processes mainwordlistsplitted = splitlist(mainwordlist, len(mainwordlist)/processes) print "list ready" t = time.time() for submainwordlist in mainwordlistsplitted: print "sub" p = Process(target=f2, args=(wordlist,submainwordlist,q,)) p.Daemon = True p.start() for submainwordlist in mainwordlistsplitted: p.join() print time.time()-t while True: print q.get()
因此,随之而来的是,更好的方法可能是剥离 n个 进程,每个进程负责加载/生成列表的 1 / n 段并检查单词是否在列表的该部分中。