在Python中以熊猫的方式对数据框进行装箱

一尘不染

在Python中以熊猫的方式对数据框进行装箱

python

给出以下熊猫数据框：

import numpy as np
df = pandas.DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id": np.arange(100)})

其中id针对由以下组成的每个点的IDa和b值，我怎样才能仓a和b成一组指定的仓（这样我可以再取中值/平均值a和b每个仓中）？
df可能对中的任何给定行具有或（或两者都有）NaN值。谢谢。a``b``df

这是将Joe Kington的解决方案与更实际的df结合使用的更好示例。我不确定的事情是如何访问以下每个df.a组的df.b元素：

a = np.random.random(20)
df = pandas.DataFrame({"a": a, "b": a + 10})
# bins for df.a
bins = np.linspace(0, 1, 10)
# bin df according to a
groups = df.groupby(np.digitize(df.a,bins))
# Get the mean of a in each group
print groups.mean()
## But how to get the mean of b for each group of a?
# ...

阅读 137

2020-12-20

共1个答案

一尘不染

也许有一种更有效的方法（我觉得pandas.crosstab这里很有用），但是这是我的方法：

import numpy as np
import pandas

df = pandas.DataFrame({"a": np.random.random(100),
                       "b": np.random.random(100),
                       "id": np.arange(100)})

# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(np.digitize(df.a, bins))

# Get the mean of each bin:
print groups.mean() # Also could do "groups.aggregate(np.mean)"

# Similarly, the median:
print groups.median()

# Apply some arbitrary function to aggregate binned data
print groups.aggregate(lambda x: np.mean(x[x > 0.5]))

编辑：作为OP是为刚刚手段特别要求b在分级的价值观a，只是做

groups.mean().b

另外，如果您希望索引看起来更好（例如，显示间隔作为索引），如@bdiamante的示例中所示，请使用pandas.cut代替numpy.digitize。（对比达曼特表示敬意。我没有意识到pandas.cut存在。）

import numpy as np
import pandas

df = pandas.DataFrame({"a": np.random.random(100), 
                       "b": np.random.random(100) + 10})

# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(pandas.cut(df.a, bins))

# Get the mean of b, binned by the values in a
print groups.mean().b

结果是：

a
(0.00186, 0.111]    10.421839
(0.111, 0.22]       10.427540
(0.22, 0.33]        10.538932
(0.33, 0.439]       10.445085
(0.439, 0.548]      10.313612
(0.548, 0.658]      10.319387
(0.658, 0.767]      10.367444
(0.767, 0.876]      10.469655
(0.876, 0.986]      10.571008
Name: b

2020-12-20