爬取网页数据

2023-09-23 13:50:45

爬取网页数据

学习目标

了解什么是urllib库，能够快速使用urllib爬取网页
掌握如何转换URL编码，可以使用GET和POST两种方式实现数据传输
知道伪装浏览器的用途，能够发送加入特定Headers的请求
掌握如何自定义opener，会设置代理服务器
理解服务器的超时，可以设置等待服务器响应的时间
熟悉一些常见的网络异常，可以对其捕获后进行相应的额处理
掌握requests库的使用，能够深入体会到requests的人性化

urllib库的概述

urllib库是python内置的HTTP请求库，它可以看作处理URL的组件集合。
urllib库包含四大模块：

urllib.request:请求模块
urllib.error:异常处理模块
urllib.parse:URL解析模块
urllib.robotparser:robots.txt解析模块

使用urllib爬取网页

快速爬取一个网页

import urllib.request

# 调用urllib.request库中的urlopen()方法，并传入一个url
response = urllib.request.urlopen('http://www.baidu.com')
# 使用read()方法读取获取到的网页内容
html = response.read().decode('utf-8')

print(html)

使用HTTPResponse对象

HTTPResponse类属于http.client模块，该类提供了获取URL、状态码、响应内容等一系列方法。
常见方法：

geturl():用于获取响应内容的URL，该方法可以验证发送的HTTP请求是否被重新调配
info():返回页面元信息
getcode():返回HTTP请求的响应状态码

import urllib.request

response = urllib.request.urlopen('http://python.org')
# 获取响应信息的URL
print(response.geturl())
# 获取响应码
print(response.getcode())
# 获取页面的元信息
print(response.info())

构造Request对象

示例一：

import urllib.request

# 将url作为Request()方法的参数，构造并返回一个Request对象
request = urllib.request.Request('http://www.baidu.com')
# 将Request对象作为urlopen()方法的参数，发送给服务器并接收响应
response = urllib.request.urlopen(request)
# 使用read()方法读取获取到的网页内容
html = response.read().decode('utf-8')

print(html)

示例二：

import urllib.request
import urllib.parse

url = 'http://www.itcast.com'
header = {
    'User-Agent':'Mozilla/5.0 (compatible;MSIE 9.0;Windows NT6.1;Trident/5.0)',
    'Host':'httpbin.org'
}
dict_demo = {'name':'itcast'}
data = bytes(urllib.parse.urlencode(dict_demo).encode('utf-8'))
# 将url作为Request()方法的参数，构造并返回一个Request对象
request = urllib.request.Request(url=url,data=data,headers=header)
# 将Request()方法作为urlopen()的参数，发送给服务器并接收响应
response = urllib.request.urlopen(request)
# 使用read()方法读取获取到的网页的内容
html = response.read().decode('utf-8')

print(html)

使用urllib实现数据传输

urllib编码转换

import urllib.parse
date = {
    'a':'你好',
    'b':'中国'
}
result = urllib.parse.urlencode(date)
print(result)

反之

import urllib.parse

result1 = urllib.parse.unquote('a=%E4%BD%A0%E5%A5%BD')
result2 = urllib.parse.unquote('b=%E4%B8%AD%E5%9B%BD')
print(result1)
print(result2)

处理GET请求

import urllib.request
import urllib.parse

url = 'http://www.baidu.com'
word = {'wd':'传智播客'}
# 转换成url编码格式
word = urllib.parse.urlencode(word)
new_url = url+'?'+word
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebkit/537.36 (KHTML,like Gecko) Chrome/51.0.2704.103 Safari/537.36'}
request = urllib.request.Request(new_url,headers=headers)
response = urllib.request.urlopen('UTF-8')
html = response.read().decode('utf-8')
print(html)

处理POST请求

import urllib.request
import urllib.parse

# POST请求的目标URL
url = "https://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule"
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}
# 打开Fiddler请求窗口，点击WebForms选项查看数据
formdata = {
    'i': 'hello,china!',
    "from": "AUTO",
    "to": "AUTO",
    "doctype": "json",
    "version": "2.1",
    "keyfrom": "fanyi.web",
    "action": "FY_BY_REALTlME"
}
data = bytes(urllib.parse.urlencode(formdata).encode('utf-8'))
request = urllib.request.Request(url,data=data,headers=headers)
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

代理服务器

简单的自定义opener

opener 是urllib.request.OpenerDirector类的对象，之前一直使用的urlopen就是模块构建好的一个openner，但是它不支持代理、cookie等其他的HTTP/HTTPS高级功能。

自定义opener需要执行下列步骤：

使用相关的Handler处理器创建特定功能的处理器对象。
通过urllib.request.build_opener()方法使用这些处理器对象创建自定义的openner对象。
使用自定义的opener对象，调用open()方法发送请求。

import urllib.request
#构建一个httpHandler处理器对象，支持处理HTTP请求
http_handler = urllib.request.HTTPSHandler()
#调用urllib2.build_opener()方法，创建支持处理HTTP请求的opener对象
opener = urllib.request.build_opener(http_handler)
#构建Request请求
request = urllib.request.Request('http://www.baidu.com')
#调用自定义的opener对象的open()方法，发送request请求
response = opener.open(request)
print(response.read())

设置代理服务器

import urllib.request
#构建了两个代理Hander，一个由代理IP，一个没有代理IP
httpproxy_handler = urllib.request.ProxyHandler({"http":"124.88.67.81.80"})
nullproxy_handler = urllib.request.ProxyHandler({})
proxy_switch = True
#通过urllib.request.build_opener()方法使用代理Handler对象创建自定义opener对象
#根据代理开关是否打开，使用不同的代理模式
if proxy_switch:
    opener = urllib.request.build_opener(httpproxy_handler)
else:
    opener = urllib.request.build_opener(nullproxy_handler)
request = urllib.request.Request("http://www.baidu.com")
response = opener.open(request)
print(response.read())

超时设置

import urllib.request
try:
    url = 'http://218.56.132.157:8080'
    file = urllib.request.urlopen(url,timeout=1)
    result = file.read()
    print(result)
except Exception as error:
    print(error)

常见的网络异常

URLError异常和捕获

原因

没有连接网络
服务器连接失败
找不到指定服务器

import urllib.request
import urllib.error
request = urllib.request.Request("http://www.ajkfhaajfgj.com")
try:
    urllib.request.urlopen(request,timeout=5)
except urllib.error.URLError as err:
    print(err)

HttpError异常和捕获

import urllib.request
import urllib.error
request = urllib.request.Request("http://www.itcast.cn/net")
try:
    urllib.request.urlopen(request)
except urllib.error.URLError as e:
    print(e.code)

更人性化的requests库

使用urllib库以GET请求的方式爬取网页

import urllib.request
import urllib.parse
#请求的URL路径和查询参数
url="http://www.baidu.com/s"
word={"wd":"传智播客"}
#转换成url编码格式
word=urllib.parse.urlencode(word)
#拼接完整的URL路径
new_url=url+"?"+word
#请求报头
headers = {
    "User-Agent":"Mozillla/5.0 (Windows NT 10.0;WOW64)AppleWebKit/537.36 (KHTML,like Gecko) Chrome/51.0.2704.103Safari/537.36"
}
#根据URL和headers构建请求
request=urllib.request.Request(new_url,headers=headers)
#发送请求，并接收服务器返回的文件对象
response=urllib.request.urlopen(request)
#使用read()方法读取获取到的网页内容，使用UTF-8格式进行解码
html=response.read().decode('UTF-8')
print(html)

使用requests库以GET请求的方式爬取网页

#导入requests库
import requests
#请求的URL路径和查询参数
url="http://www.baidu.com/s"
param={"wd":"传智播客"}
#请求报头
headers = {
    "User-Agent":"Mozillla/5.0 (Windows NT 10.0;WOW64)AppleWebKit/537.36 (KHTML,like Gecko) Chrome/51.0.2704.103Safari/537.36"
}

#发送一个GET请求，返回一个响应对象
response=requests.get(url,params=param,headers=headers)
print(response.text)

requests库的便捷之处

无须再转化为URL路径编码格式拼接完整的URL路径
无须再频繁地为中文转换编码格式
从发送请求的函数名称，可以很直观地判断发送到服务器的方式
urlopen()方法返回的是一个文件对象，需要调用read()方法一次性读取；而get()函数返回的是一个响应对象，可以访问该对象的text属性查看响应的内容

码农公寓

爬取网页数据

学习目标

urllib库的概述

使用urllib爬取网页

快速爬取一个网页

使用HTTPResponse对象

构造Request对象

使用urllib实现数据传输

urllib编码转换

处理GET请求

处理POST请求

代理服务器

简单的自定义opener

设置代理服务器

超时设置

常见的网络异常

URLError异常和捕获

HttpError异常和捕获

更人性化的requests库

相关文章