一尘不染

使用R / Selenium仅在Disqus中查找最热门的帖子

selenium

首先,对于本文的篇幅,我深表歉意,因为我想提供足够详细的信息。

我正在尝试完善我在R中编写的抓取应用程序以获取Disqus评论。到目前为止,我可以使用各种RSelenium功能获得特定页面上的所有评论。我现在想做的是从发布的评论中获取某种树状结构,即首先获取最上面的评论,然后检查这些评论中是否有任何子代。我作为示例使用的网站上的特定页面总共有34条评论,但其中只有18条是最高的。其余都是孩子或孩子的孩子。

我正在做的是在Chrome中打开一个页面并创建一个网络驱动程序,我使用选择器小工具找到正确的选择器,如下所示:

1. elem <- remDr$findElement(using = "id", value = "posts")
2. elem.posts <- elem$findChildElements(using = "id", value = "post-list")
3. elem.posts <- elem$findElements(using = 'css selector', value = '.post~ .post+ .post')

在上面的代码中,第1行找到了发帖部分,然后,如果我使用第2行,我得到了页面上的所有帖子,然后我使用下面的行来查找所有消息,因此,如果页面上有34条评论,我会得到商场。

elem.msgs <- elem.posts[[1]]$findChildElements(using = 'css selector', '.post-message')

现在我已经意识到,注释的“树”结构对于我的数据项目可能很重要,并且正在尝试首先获取最高的注释,然后查看每个最高的注释以查找任何可用的子代。示例网页在此处。为了获得评论,我在上面使用了第1行和第3行,结果是列表16,如果我使用的话,elem.posts[[1]]$getElementAttribute("id")我获得了帖子ID,以后可以用来查找每个顶部评论。

此列表16应该是18,我不明白为什么列表中没有捕获前两个注释。这在其他页面中也没有发生,在这些页面中,列表中没有捕获到许多最上面的注释。

我的问题是:我可以尝试使用什么以便我可以在页面上获得所有最顶部的评论而没有任何评论遗漏?有没有更好的方式来获得最高的评论,而无需经历我没有经验的回旋处?

感谢您的帮助或指导。


阅读 363

收藏
2020-06-26

共1个答案

一尘不染

您可以使用递归函数来降低帖子。您只需要RSelenium即可获取页面源:

library(xml2)
library(RSelenium)
library(jsonlite)
selServ <- startServer()
appURL <- "http://disqus.com/embed/comments/?base=default&version=90aeb3a56d1f2d3db731af14996f11cf&f=malta-today&t_i=article_67726&t_u=http%3A%2F%2Fwww.maltatoday.com.mt%2Fnews%2Fnational%2F67726%2Fair_malta_pilots_demands_30_basic_salary_increase&t_d=Air%20Malta%20pilots%E2%80%99%20demands%3A%2030%25%20basic%20salary%20increase%2C%20increased%20duty%20payments%2C%20double%20%E2%80%98denied%20leave%E2%80%99%20payment&t_t=Air%20Malta%20pilots%E2%80%99%20demands%3A%2030%25%20basic%20salary%20increase%2C%20increased%20duty%20payments%2C%20double%20%E2%80%98denied%20leave%E2%80%99%20payment&s_o=default"
remDr <- remoteDriver()
remDr$open()
remDr$navigate(appURL)
pgSource <- remDr$getPageSource()[[1]]
remDr$close()
selServ$stop()
doc <- read_html(pgSource)
appNodes <- xml_find_all(doc, "//ul[@id='post-list']/li[@class='post']")
# write recursive function to get 
content_fun <- function(x){
  main <- xml_find_all(x, "./div[@data-role]/.//div[@class='post-body']")
  main <- list(
    poster = xml_text(xml_find_all(main, ".//span[@class = 'post-byline']")),
    posted = xml_text(xml_find_all(main, ".//span[@class = 'post-meta']")),
    date = xml_attr(xml_find_all(main, ".//a[@class = 'time-ago']"), "title"),
    message = xml_text(xml_find_all(main, ".//div[@data-role = 'message']"))
  )
  # check for children
  children <- xml_find_all(x, "./ul[@class='children']/li[@class='post']")
  if(length(children) > 0){
    main$children <- lapply(children, content_fun)
  }
  main
}

postData <- lapply(appNodes, content_fun)

例如,这是第三篇文章

> prettify(toJSON(postData[[3]]))
{
    "poster": [
        "\nMary Attard\n\n"
    ],
    "posted": [
        "\n•\n\n\na month ago\n\n"
    ],
    "date": [
        "Thursday, July 21, 2016 6:12 AM"
    ],
    "message": [
        "\nI give up. Air Malta should be closed down.\n"
    ],
    "children": [
        {
            "poster": [
                "\nJoseph Lawrence\n\n Mary Attard\n"
            ],
            "posted": [
                "\n•\n\n\na month ago\n\n"
            ],
            "date": [
                "Thursday, July 21, 2016 7:43 AM"
            ],
            "message": [
                "\nAir Malta should have been privatized or sold out right a long time ago. It is costing the TAX PAYER millions, it has for a long, long time.\n"
            ]
        },
        {
            "poster": [
                "\nJ.Borg\n\n Mary Attard\n"
            ],
            "posted": [
                "\n•\n\n\na month ago\n\n"
            ],
            "date": [
                "Thursday, July 21, 2016 5:23 PM"
            ],
            "message": [
                "\nYes - at this stage we taxpayers will be better off without Air Malta. We closed Malta Dry Docks and we survived. We can close Air Malta and we'll survive even better. After all, we have many more airlines serving us.\n"
            ]
        }
    ]
}

您可以清理并抓取所需的内容。

2020-06-26