如何有效地迭代熊猫数据帧的连续块

一尘不染

如何有效地迭代熊猫数据帧的连续块

python

我有一个大的数据框（几百万行）。

我希望能够对它进行分组操作，而只是按行的任意连续（最好大小相等）的子集进行分组，而不是使用各个行的任何特定属性来确定它们要进入的组。

用例：我想通过IPython中的并行映射将函数应用于每一行。哪行进入哪个后端引擎都没有关系，因为该函数一次基于一行来计算结果。（至少在概念上；实际上是矢量化的。）

我想出了这样的东西：

# Generate a number from 0-9 for each row, indicating which tenth of the DF it belongs to
max_idx = dataframe.index.max()
tenths = ((10 * dataframe.index) / (1 + max_idx)).astype(np.uint32)

# Use this value to perform a groupby, yielding 10 consecutive chunks
groups = [g[1] for g in dataframe.groupby(tenths)]

# Process chunks in parallel
results = dview.map_sync(my_function, groups)

但这似乎很漫长，并且不能保证大小相等的块。尤其是当索引是稀疏的或非整数的或诸如此类的时候。

有什么更好的建议吗？

谢谢！

阅读 144

2020-12-20

共1个答案

一尘不染

实际上，您不能保证
大小相等的块。行数（N）可能是素数，在这种情况下，您只能获得大小相等的1或N块。因此，实际分块通常使用固定大小，并允许最后使用较小的块。我倾向于将数组传递给groupby。从…开始：

>>> df = pd.DataFrame(np.random.rand(15, 5), index=[0]*15)
>>> df[0] = range(15)
>>> df
    0         1         2         3         4
0   0  0.746300  0.346277  0.220362  0.172680
0   1  0.657324  0.687169  0.384196  0.214118
0   2  0.016062  0.858784  0.236364  0.963389
[...]
0  13  0.510273  0.051608  0.230402  0.756921
0  14  0.950544  0.576539  0.642602  0.907850

[15 rows x 5 columns]

我故意通过将索引设置为0来使索引无意义，我们只需确定大小（此处为10），然后按整数除以一个数组即可：

>>> df.groupby(np.arange(len(df))//10)
<pandas.core.groupby.DataFrameGroupBy object at 0xb208492c>
>>> for k,g in df.groupby(np.arange(len(df))//10):
...     print(k,g)
...     
0    0         1         2         3         4
0  0  0.746300  0.346277  0.220362  0.172680
0  1  0.657324  0.687169  0.384196  0.214118
0  2  0.016062  0.858784  0.236364  0.963389
[...]
0  8  0.241049  0.246149  0.241935  0.563428
0  9  0.493819  0.918858  0.193236  0.266257

[10 rows x 5 columns]
1     0         1         2         3         4
0  10  0.037693  0.370789  0.369117  0.401041
0  11  0.721843  0.862295  0.671733  0.605006
[...]
0  14  0.950544  0.576539  0.642602  0.907850

[5 rows x 5 columns]

当索引与索引不兼容时，基于切片DataFrame的方法可能会失败，尽管您始终.iloc[a:b]可以忽略索引值并按位置访问数据。

2020-12-20