网络爬虫-学习记录（一）初步爬取豆瓣电影榜单

2023-12-18 17:07:40

一、任务

1.爬取豆瓣榜单第一的电影详细内容

2.爬取豆瓣近期热门榜单的所有电影详细内容

二、描述任务

1.url：https://maoyan.com/board

2.使用urllib库request模板中的urlopen函数获得请求数据，获取页面信息后运用beautifulSoup库定位HTML标签找到需要的网页信息（运用BeautifulSoup库中find和findAll函数进行标签定位查找）

3.进行异常处理

三、运用的库和模块

1.Urllib库的request模块

2.BeautifulSoup库的find函数、findAll函数

四、运行结果及说明

1. 说明：爬取的是第一部《1950他们正年轻》电影的详细信息

2，说明：爬取的是近期热榜的榜单电影

五、源码

1，

from urllib.request import urlopen

from urllib.error import HTTPError

from urllib.error import URLError

from bs4 import BeautifulSoup

import requests

try:

html ="https://maoyan.com/board"

headers = {

"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36",

}

resp = requests.get(html,headers = headers)

html = BeautifulSoup(resp.content,'html.parser')

#string编码格式输出

#电影名称

name1 = html.find('p',{'class':'name'}).string

#主演

stars = html.find('p',{'class':'star'}).string

#上映时间

releasetime1 = html.find('p',{'class':'releasetime'}).string

#评分

score = html.find('i',{'class':'integer'}).string + html.find('i',{'class':'fraction'}).string

print("电影:" + name1)

print("主演:" + stars)

print("上映时间:" + releasetime1)

print("评分:" + score)

except HTTPError as e:

print(e)

except URLError as e:

print('The server could not be found')

else:

print('It Worked!')

2，