用Python requests库实现的简单爬虫

刚碰Python的时候就听说了可以写代码扒图,心想着试一试,于是写了一个简单的demo实现从百度图片上面扒图的功能,用的是requests库。

为什么不用自带的urllib库呢?其实本人并不知道urllib和requests两个库孰优孰劣,或者说根本不能比较,因为使用的场合不尽相同,使用者的爱好也不尽相同,只是先看到了requests库的教程于是先去查了它的文档,所以就用了这个库。

基本实现步骤如下:

  1. 向image.baidu.com发一个带数据的get请求,包含要搜索的关键字keyword
  2. 拿到response之后用正则表达式筛选图片的URL,存入列表
  3. 遍历图片URL列表,逐个请求图片URL,将响应以二进制方式写入文件,保存

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import requests as rq
import re

# Get response
keyword = input('Input keyword to search: ')
keyword = str(keyword)
print('Busy...')
url = 'http://image.baidu.com/search/index'
data = { # 发送get请求时上传的dict形式数据包
'tn': 'baiduimage',
'ct': 201326592,
'cl': 2,
'lm': -1,
'pv': '',
'word': keyword,
'z': 0,
'ie': 'utf-8'
}
response = rq.get(url, data)
response.encoding = 'utf-8'

# Match URLs
match_pattern = r'"objURL":"([a-zA-Z0-9:_/.-]*)"'
matcher = re.compile(match_pattern)
objURL_list = matcher.findall(response.text)
print('---------------------------------')
print('Find %d pics!' %len(objURL_list))

# Save pics
print('Busy...')
img_counter = 0
for each in objURL_list:
img_page = rq.get(each)
with open('./catch_pics/'+str(img_counter)+'.jpg', 'wb') as img:
img.write(img_page.content)
print(str(img_counter)+'.jpg is done!')
img_counter += 1
print('---------------------------------')
print('Done with %d pics!' %img_counter)

目前这个demo只支持一次扒下来30张图,原因是百度图片搜索后默认先加载出30张图。如果希望可以扒下来更多的图,需要实现“翻页”功能。

先用百度图片随便搜索一个关键字(如cat),在返回结果之后打开开发者工具,在Network中查看Headers的数据,发现每30张新图片开始加载时收到XHR,查看发现关键内容pn,怀疑:

  • pn:已经显示的图片数page number

如果猜测正确,我们只需要在发送请求时写入pn为希望显示的图片数即可,然后一切照旧。

以后会验证猜想继续改进本demo。


2018.8.6
整理了一遍代码,现在支持多页扒图,可以选择扒图的页数。修改了发送请求的url为旧版百度图片网址,因为觉得旧版肯定支持数据包中pn的接收(不过似乎新版也可以,和网页布局无关…都是服务器去处理)。

修改后的代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import requests as rq
import re

# Get response
keyword = input('Input keyword to search: ')
keyword = str(keyword)
pages = input('How many pages do you want to search?(About 60p per-page): ')
pages = int(pages)
print('Busy...')
url = 'http://image.baidu.com/search/flip'
page_index = 1
objURL_list = []
while page_index <= pages:
data = {
'tn': 'baiduimage',
'word': keyword,
'ie': 'utf-8',
'pn': page_index-1 * 20
}
response = rq.get(url, data)
response.encoding = 'utf-8'

# Match URLs
match_pattern = r'"objURL":"([a-zA-Z0-9:_/.-]*)"'
matcher = re.compile(match_pattern)
objURL_list += matcher.findall(response.text)
page_index += 1
print('---------------------------------')
print('Totally find %d pics!' %len(objURL_list))
print('---------------------------------')
# Save pics
print('Busy...')
img_counter = 0
for each in objURL_list:
img_page = rq.get(each)
with open('./catch_pics/'+str(img_counter)+'.jpg', 'wb') as img:
img.write(img_page.content)
print(str(img_counter+1)+' of '+str(len(objURL_list))+' pics'+' are done!')
img_counter += 1
print('---------------------------------')
print('Done with %d pics!' %img_counter)

2018.8.8
重构了大量代码,因为发现了许多之前的错误。以之前的版本为基础设计了一个在safebooru扒图的爬虫,增加了一些细节如headers伪装、爬取频率减低等。

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import requests as rq
import re
import time
import os
'''未来考虑加入多关键词筛选'''

keyword = input('Safebooru Search: ')
keyword = str(keyword)
pages = input('How many pages do you want to search?(About 40p per-page): ')
pages = int(pages)
url = 'http://safebooru.org/index.php?page=post&s=list'
page_index = 1
img_url = []
while page_index <= pages:
# Create data pack
data = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'zh-CN,zh;q=0.8,en;q=0.6',
'Connection':'keep-alive',
'Host':'safebooru.org',
'Upgrade-Insecure-Requests':1,
'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Mobile Safari/537.36',
'page':'post',
's':'list',
'tags':keyword,
'pid':(page_index-1)*40
}
try:
# Get response
res = rq.get(url, data, timeout=30)
res.encoding = 'utf-8'
# Match images
match_patter = r'src="//safebooru.org/thumbnails/([a-zA-Z0-9_/:.-?]*)"'
matcher = re.compile(match_patter)
img_url += matcher.findall(res.text)
print('---------------------------------')
print('Get page %d of %d!' %(page_index, pages))
if page_index < pages:
print('Next page starts in 3 secs...')
time.sleep(3)
except rq.exceptions.ConnectionError as e:
e_str = str(e)
print('Error type: %s! Jump to next page!' %e_str)
finally:
page_index += 1
print('Totally find %d pics!' %len(img_url))
print('---------------------------------')

# Preprocessing of pic URLs
new_img_url = []
for each in img_url:
new_each = each.replace('thumbnail', 'sample')
new_each = 'http://safebooru.org//samples/' + new_each
new_img_url.append(new_each)

# Create image folder
if not os.path.exists('./pic/'):
os.mkdir('./pic/')

# Download pics
img_counter = 0
for each in new_img_url:
try:
img_page = rq.get(each, timeout=30)
if img_page.status_code == 404: # Check URL's validation for pics in safebooru
new_url = each.replace('sample_', '')
new_url = new_url.replace('samples', 'images')
img_page = rq.get(new_url, timeout=30)
with open('./pic/'+str(img_counter)+'.jpg', 'wb') as img:
img.write(img_page.content)
print(str(img_counter+1)+' of '+str(len(new_img_url))+' pics'+' are done!')
except rq.exceptions.ConnectionError as e:
e_str = str(e)
print('Error type: %s! Jump to next image URL!' %e_str)
finally:
img_counter += 1
print('---------------------------------')
print('Done with %d pics!' %img_counter)
os.system('pause')

2018.8.17

东敲西改,零零碎碎写了几天之后完成了从Pixiv上爬取原图的代码,顺便感谢一下Pixiv官方终于支持在搜索时使用数字表示热度了,对过滤图片来说方便了不少(笑)。

至此,爬图的小项目就告一段落吧~