一尘不染

Python:带熊猫的加权中值算法

algorithm

我有一个看起来像这样的数据框:

Out[14]:
    impwealth  indweight
16     180000     34.200
21     384000     37.800
26     342000     39.715
30    1154000     44.375
31     421300     44.375
32    1210000     45.295
33    1062500     45.295
34    1878000     46.653
35     876000     46.653
36     925000     53.476

我想impwealth使用中的频率权重来计算列的加权中位数indweight。我的伪代码如下所示:

# Sort `impwealth` in ascending order 
df.sort('impwealth', 'inplace'=True)

# Find the 50th percentile weight, P
P = df['indweight'].sum() * (.5)

# Search for the first occurrence of `impweight` that is greater than P 
i = df.loc[df['indweight'] > P, 'indweight'].last_valid_index()

# The value of `impwealth` associated with this index will be the weighted median
w_median = df.ix[i, 'impwealth']

该方法似乎很笨拙,我不确定它是否正确。我没有在熊猫参考书中找到内置的方法来执行此操作。寻找加权中位数的最佳方法是什么?


阅读 325

收藏
2020-07-28

共1个答案

一尘不染

如果您想在纯熊猫中做到这一点,请尝试以下方法。它也不插值。(@svenkatesh,您在伪代码中缺少累积总和)

df.sort_values('impwealth', inplace=True)
cumsum = df.indweight.cumsum()
cutoff = df.indweight.sum() / 2.0
median = df.impwealth[cumsum >= cutoff].iloc[0]

中位数为925000。

2020-07-28