我有一个熊猫数据框,我想根据是否满足某些条件进行过滤。我跑了一个循环,.apply()然后用来%%timeit测试速度。数据集大约有45000行。循环的代码片段为:
.apply()
%%timeit
%%timeit qualified_actions = [] for row in all_actions.index: if all_actions.ix[row,'Lower'] <= all_actions.ix[row, 'Mid'] <= all_actions.ix[row,'Upper']: qualified_actions.append(True) else: qualified_actions.append(False)
每个循环1.44 s±3.7毫秒(平均±标准偏差,共7次运行,每个循环1次)
而且.apply()是:
%%timeit qualified_actions = all_actions.apply(lambda row: row['Lower'] <= row['Mid'] <= row['Upper'], axis=1)
每个循环6.71 s±54.6 ms(平均±标准偏差,共7次运行,每个循环1次)
我认为.apply()应该比循环遍历大熊猫更快。有人可以解释为什么在这种情况下速度变慢吗?
apply在后台使用循环,因此,如果需要更好的性能,最好的和最快的方法是最好的选择。
apply
没有循环,只有链2条件向量化解决方案:
m1 = all_actions['Lower'] <= all_actions['Mid'] m2 = all_actions['Mid'] <= all_actions['Upper'] qualified_actions = m1 & m2
感谢on Clements提供的另一种解决方案:
all_actions.Mid.between(all_actions.Lower, all_actions.Upper)
时间 :
np.random.seed(2017) N = 45000 all_actions=pd.DataFrame(np.random.randint(50, size=(N,3)),columns=['Lower','Mid','Upper']) #print (all_actions)
In [85]: %%timeit ...: qualified_actions = [] ...: for row in all_actions.index: ...: if all_actions.ix[row,'Lower'] <= all_actions.ix[row, 'Mid'] <= all_actions.ix[row,'Upper']: ...: qualified_actions.append(True) ...: else: ...: qualified_actions.append(False) ...: ...: __main__:259: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated 1 loop, best of 3: 579 ms per loop In [86]: %%timeit ...: (all_actions.apply(lambda row: row['Lower'] <= row['Mid'] <= row['Upper'], axis=1)) ...: 1 loop, best of 3: 1.17 s per loop In [87]: %%timeit ...: ((all_actions['Lower'] <= all_actions['Mid']) & (all_actions['Mid'] <= all_actions['Upper'])) ...: 1000 loops, best of 3: 509 µs per loop In [90]: %%timeit ...: (all_actions.Mid.between(all_actions.Lower, all_actions.Upper)) ...: 1000 loops, best of 3: 520 µs per loop