我曾经在之后rosetta.parallel.pandas_easy进行并行化,例如:apply``groupby
rosetta.parallel.pandas_easy
apply``groupby
from rosetta.parallel.pandas_easy import groupby_to_series_to_frame df = pd.DataFrame({'a': [6, 2, 2], 'b': [4, 5, 6]},index= ['g1', 'g1', 'g2']) groupby_to_series_to_frame(df, np.mean, n_jobs=8, use_apply=True, by=df.index)
但是,有没有人想出如何并行化返回DataFrame的函数?rosetta如预期,此代码对于失败。
rosetta
def tmpFunc(df): df['c'] = df.a + df.b return df df.groupby(df.index).apply(tmpFunc) groupby_to_series_to_frame(df, tmpFunc, n_jobs=1, use_apply=True, by=df.index)
尽管确实应该将其内置到熊猫中,但这似乎可行
import pandas as pd from joblib import Parallel, delayed import multiprocessing def tmpFunc(df): df['c'] = df.a + df.b return df def applyParallel(dfGrouped, func): retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped) return pd.concat(retLst) if __name__ == '__main__': df = pd.DataFrame({'a': [6, 2, 2], 'b': [4, 5, 6]},index= ['g1', 'g1', 'g2']) print 'parallel version: ' print applyParallel(df.groupby(df.index), tmpFunc) print 'regular version: ' print df.groupby(df.index).apply(tmpFunc) print 'ideal version (does not work): ' print df.groupby(df.index).applyParallel(tmpFunc)