rvest豆瓣电影爬取评分分析

今天写两篇好了, 免得过几天又要整理,…而且今天感觉爬虫复习了挺多的…用R感觉比python好用多了啊…

关于豆瓣电影排名与时间的关系分析

从最简单的开始!

开爬!

这里做一个分析, 豆瓣电影的排名和年份有没有关系呢..

不多说代码如下

library(rvest)

url = 'http://movie.douban.com/top250?format=text'
htmllist = paste("https://movie.douban.com/top250?start=",seq(0,225,25),"&filter=", sep = "")
count = 1
movie <- data.frame()
for(i in htmllist){
  web = html(i,encoding="UTF-8")
  score =  web %>% html_nodes(".rating_num") %>% html_text()

  year =  web %>% html_nodes(".bd p:nth-child(1)") %>% html_text()

  gy = gregexpr('[0-9]{4}',year)
  gd = gregexpr("导演",year)
  gz = gregexpr("主演",year)

  time = sapply(1:length(gy),function(i) substr(year[i],gy[[i]],gy[[i]]+attr(gy[[i]],'match.length')-1))
  time = time[2:length(time)]
  direct = sapply(1:length(gy),function(i) substr(year[i],gd[[i]] + 4,gz[[i]] - 4))
  direct = direct[2:length(direct)]
  actor = sapply(1:length(gy),function(i) substr(year[i],gz[[i]] + 4,gy[[i]] - 4))
  actor = actor[2:length(actor)]
  ga = gregexpr('\n',actor)
  actor = sapply(1:25,function(i) substr(actor[i],1,ga[[i]] - 4))

  names =   web %>% html_nodes(".title:nth-child(1)") %>% html_text()

  rates =  html_text(web %>% html_nodes(".rating_num~ span"))
  rates = rates[grep("人评价",rates)]
  rates = gsub("人评价","",rates)

  time <- as.Date(time, "%Y")
  score <- as.numeric(score)
  rates <- as.numeric(rates)
  movie <- rbind(movie, data.frame(names, time, score, rates, actor, direct, rank = count:(count+24)))
  count <- count+25
}
ggplot(movie, aes(time, rank)) + geom_point() + geom_smooth() + theme_bw(base_family = "Times")
ggplot(movie, aes(time, rank)) + geom_point() + geom_smooth(method = 'lm') + theme_bw(base_family = "Times")
ggplot(movie, aes(time, score)) + geom_point() + geom_smooth(method = "lm") + theme_bw(base_family = "Times")
ggplot(movie, aes(score, rank)) + geom_point() + geom_smooth(method = "lm") + theme_bw(base_family = "Times")
ggplot(movie, aes(rank, rates)) + geom_point() + geom_smooth(method = "lm") + theme_bw(base_family = "Times")
head(movie)

爬取了前10页也就是前225名的数据