给出以下熊猫数据框:
import numpy as np df = pandas.DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id": np.arange(100)})
其中id针对由以下组成的每个点的IDa和b值,我怎样才能仓a和b成一组指定的仓(这样我可以再取中值/平均值a和b每个仓中)? df可能对中的任何给定行具有或(或两者都有)NaN值。谢谢。a``b``df
id
a
b
df
NaN
a``b``df
这是将Joe Kington的解决方案与更实际的df结合使用的更好示例。我不确定的事情是如何访问以下每个df.a组的df.b元素:
a = np.random.random(20) df = pandas.DataFrame({"a": a, "b": a + 10}) # bins for df.a bins = np.linspace(0, 1, 10) # bin df according to a groups = df.groupby(np.digitize(df.a,bins)) # Get the mean of a in each group print groups.mean() ## But how to get the mean of b for each group of a? # ...
也许有一种更有效的方法(我觉得pandas.crosstab这里很有用),但是这是我的方法:
pandas.crosstab
import numpy as np import pandas df = pandas.DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id": np.arange(100)}) # Bin the data frame by "a" with 10 bins... bins = np.linspace(df.a.min(), df.a.max(), 10) groups = df.groupby(np.digitize(df.a, bins)) # Get the mean of each bin: print groups.mean() # Also could do "groups.aggregate(np.mean)" # Similarly, the median: print groups.median() # Apply some arbitrary function to aggregate binned data print groups.aggregate(lambda x: np.mean(x[x > 0.5]))
编辑:作为OP是为刚刚手段特别要求b在分级的价值观a,只是做
groups.mean().b
另外,如果您希望索引看起来更好(例如,显示间隔作为索引),如@bdiamante的示例中所示,请使用pandas.cut代替numpy.digitize。(对比达曼特表示敬意。我没有意识到pandas.cut存在。)
pandas.cut
numpy.digitize
import numpy as np import pandas df = pandas.DataFrame({"a": np.random.random(100), "b": np.random.random(100) + 10}) # Bin the data frame by "a" with 10 bins... bins = np.linspace(df.a.min(), df.a.max(), 10) groups = df.groupby(pandas.cut(df.a, bins)) # Get the mean of b, binned by the values in a print groups.mean().b
结果是:
a (0.00186, 0.111] 10.421839 (0.111, 0.22] 10.427540 (0.22, 0.33] 10.538932 (0.33, 0.439] 10.445085 (0.439, 0.548] 10.313612 (0.548, 0.658] 10.319387 (0.658, 0.767] 10.367444 (0.767, 0.876] 10.469655 (0.876, 0.986] 10.571008 Name: b