我处理具有两列、mvv 和 count 的数据框。
+---+-----+ |mvv|count| +---+-----+ | 1 | 5 | | 2 | 9 | | 3 | 3 | | 4 | 1 |
我想获取两个包含 mvv 值和 count 值的列表。例如
mvv = [1,2,3,4] count = [5,9,3,1]
因此,我尝试了以下代码:第一行应返回一行 Python 列表。我想查看第一个值:
mvv_list = mvv_count_df.select('mvv').collect() firstvalue = mvv_list[0].getInt(0)
但是我收到第二行的错误消息:
属性错误:getInt
看看为什么你这样做不起作用。首先,你尝试从Row Type 获取整数,收集的输出如下:
>>> mvv_list = mvv_count_df.select('mvv').collect() >>> mvv_list[0] Out: Row(mvv=1)
如果你采取如下做法:
>>> firstvalue = mvv_list[0].mvv Out: 1
您将获得该mvv值。如果您想要数组的所有信息,您可以采取以下方式:
mvv
>>> mvv_array = [int(row.mvv) for row in mvv_list.collect()] >>> mvv_array Out: [1,2,3,4]
但如果你对另一列尝试同样的做法,你会得到:
>>> mvv_count = [int(row.count) for row in mvv_list.collect()] Out: TypeError: int() argument must be a string or a number, not 'builtin_function_or_method'
发生这种情况是因为count是内置方法。并且 列与 同名count。解决此问题的方法是将 的列名更改count为_count:
count
_count
>>> mvv_list = mvv_list.selectExpr("mvv as mvv", "count as _count") >>> mvv_count = [int(row._count) for row in mvv_list.collect()]
但是这种解决方法是不需要的,因为您可以使用字典语法访问该列:
>>> mvv_array = [int(row['mvv']) for row in mvv_list.collect()] >>> mvv_count = [int(row['count']) for row in mvv_list.collect()]
它最终会起作用!