我有一个看起来像这样的数据框:
Out[14]: impwealth indweight 16 180000 34.200 21 384000 37.800 26 342000 39.715 30 1154000 44.375 31 421300 44.375 32 1210000 45.295 33 1062500 45.295 34 1878000 46.653 35 876000 46.653 36 925000 53.476
我想impwealth使用中的频率权重来计算列的加权中位数indweight。我的伪代码如下所示:
impwealth
indweight
# Sort `impwealth` in ascending order df.sort('impwealth', 'inplace'=True) # Find the 50th percentile weight, P P = df['indweight'].sum() * (.5) # Search for the first occurrence of `impweight` that is greater than P i = df.loc[df['indweight'] > P, 'indweight'].last_valid_index() # The value of `impwealth` associated with this index will be the weighted median w_median = df.ix[i, 'impwealth']
该方法似乎很笨拙,我不确定它是否正确。我没有在熊猫参考书中找到内置的方法来执行此操作。寻找加权中位数的最佳方法是什么?
如果您想在纯熊猫中做到这一点,请尝试以下方法。它也不插值。(@svenkatesh,您在伪代码中缺少累积总和)
df.sort_values('impwealth', inplace=True) cumsum = df.indweight.cumsum() cutoff = df.indweight.sum() / 2.0 median = df.impwealth[cumsum >= cutoff].iloc[0]
中位数为925000。