用Python requests库实现的简单爬虫

发表于 2018-08-06 更新于 2022-02-05 分类于编程

刚碰Python的时候就听说了可以写代码扒图，心想着试一试，于是写了一个简单的demo实现从百度图片上面扒图的功能，用的是requests库。

为什么不用自带的urllib库呢？其实本人并不知道urllib和requests两个库孰优孰劣，或者说根本不能比较，因为使用的场合不尽相同，使用者的爱好也不尽相同，只是先看到了requests库的教程于是先去查了它的文档，所以就用了这个库。

基本实现步骤如下：

向image.baidu.com发一个带数据的get请求，包含要搜索的关键字keyword
拿到response之后用正则表达式筛选图片的URL，存入列表
遍历图片URL列表，逐个请求图片URL，将响应以二进制方式写入文件，保存

代码如下：

import requests as rq
import re

# Get response
keyword = input('Input keyword to search: ')
keyword = str(keyword)
print('Busy...')
url = 'http://image.baidu.com/search/index'
data = {    # 发送get请求时上传的dict形式数据包
    'tn': 'baiduimage',
    'ct': 201326592,
    'cl': 2,
    'lm': -1,
    'pv': '',
    'word': keyword,
    'z': 0,
    'ie': 'utf-8'
}
response = rq.get(url, data)
response.encoding = 'utf-8'

# Match URLs
match_pattern = r'"objURL":"([a-zA-Z0-9:_/.-]*)"'
matcher = re.compile(match_pattern)
objURL_list = matcher.findall(response.text)
print('---------------------------------')
print('Find %d pics!' %len(objURL_list))

# Save pics
print('Busy...')
img_counter = 0
for each in objURL_list:
    img_page = rq.get(each)
    with open('./catch_pics/'+str(img_counter)+'.jpg', 'wb') as img:
        img.write(img_page.content)
    print(str(img_counter)+'.jpg is done!')
    img_counter += 1
print('---------------------------------')
print('Done with %d pics!' %img_counter)

目前这个demo只支持一次扒下来30张图，原因是百度图片搜索后默认先加载出30张图。如果希望可以扒下来更多的图，需要实现“翻页”功能。

先用百度图片随便搜索一个关键字（如cat），在返回结果之后打开开发者工具，在Network中查看Headers的数据，发现每30张新图片开始加载时收到XHR，查看发现关键内容pn，怀疑：

pn：已经显示的图片数page number

如果猜测正确，我们只需要在发送请求时写入pn为希望显示的图片数即可，然后一切照旧。

以后会验证猜想继续改进本demo。

2018.8.6
整理了一遍代码，现在支持多页扒图，可以选择扒图的页数。修改了发送请求的url为旧版百度图片网址，因为觉得旧版肯定支持数据包中pn的接收（不过似乎新版也可以，和网页布局无关…都是服务器去处理）。

修改后的代码如下：

import requests as rq
import re

# Get response
keyword = input('Input keyword to search: ')
keyword = str(keyword)
pages = input('How many pages do you want to search?(About 60p per-page): ')
pages = int(pages)
print('Busy...')
url = 'http://image.baidu.com/search/flip'
page_index = 1
objURL_list = []
while page_index <= pages:
    data = {
        'tn': 'baiduimage',
        'word': keyword,
        'ie': 'utf-8',
        'pn': page_index-1 * 20
    }
    response = rq.get(url, data)
    response.encoding = 'utf-8'

    # Match URLs
    match_pattern = r'"objURL":"([a-zA-Z0-9:_/.-]*)"'
    matcher = re.compile(match_pattern)
    objURL_list += matcher.findall(response.text)
    page_index += 1
print('---------------------------------')
print('Totally find %d pics!' %len(objURL_list))
print('---------------------------------')
# Save pics
print('Busy...')
img_counter = 0
for each in objURL_list:
    img_page = rq.get(each)
    with open('./catch_pics/'+str(img_counter)+'.jpg', 'wb') as img:
        img.write(img_page.content)
    print(str(img_counter+1)+' of '+str(len(objURL_list))+' pics'+' are done!')
    img_counter += 1
print('---------------------------------')
print('Done with %d pics!' %img_counter)

2018.8.8
重构了大量代码，因为发现了许多之前的错误。以之前的版本为基础设计了一个在safebooru扒图的爬虫，增加了一些细节如headers伪装、爬取频率减低等。

代码如下：

import requests as rq
import re
import time
import os
'''未来考虑加入多关键词筛选'''

keyword = input('Safebooru Search: ')
keyword = str(keyword)
pages = input('How many pages do you want to search?(About 40p per-page): ')
pages = int(pages)
url = 'http://safebooru.org/index.php?page=post&s=list'
page_index = 1
img_url = []
while page_index <= pages:
    # Create data pack
    data = {
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Encoding':'gzip, deflate, sdch',
        'Accept-Language':'zh-CN,zh;q=0.8,en;q=0.6',
        'Connection':'keep-alive',
        'Host':'safebooru.org',
        'Upgrade-Insecure-Requests':1,
        'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Mobile Safari/537.36',
        'page':'post',
        's':'list',
        'tags':keyword,
        'pid':(page_index-1)*40
    }
    try:
        # Get response
        res = rq.get(url, data, timeout=30)
        res.encoding = 'utf-8'
        # Match images
        match_patter = r'src="//safebooru.org/thumbnails/([a-zA-Z0-9_/:.-?]*)"'
        matcher = re.compile(match_patter)
        img_url += matcher.findall(res.text)
        print('---------------------------------')
        print('Get page %d of %d!' %(page_index, pages))
        if page_index < pages:
            print('Next page starts in 3 secs...')
            time.sleep(3)
    except rq.exceptions.ConnectionError as e:
        e_str = str(e)
        print('Error type: %s! Jump to next page!' %e_str)
    finally:
        page_index += 1
print('Totally find %d pics!' %len(img_url))
print('---------------------------------')

# Preprocessing of pic URLs
new_img_url = []
for each in img_url:
    new_each = each.replace('thumbnail', 'sample')
    new_each = 'http://safebooru.org//samples/' + new_each
    new_img_url.append(new_each)

# Create image folder
if not os.path.exists('./pic/'):
    os.mkdir('./pic/')

# Download pics
img_counter = 0
for each in new_img_url:
    try:
        img_page = rq.get(each, timeout=30)
        if img_page.status_code == 404: # Check URL's validation for pics in safebooru
            new_url = each.replace('sample_', '')
            new_url = new_url.replace('samples', 'images')
            img_page = rq.get(new_url, timeout=30)
        with open('./pic/'+str(img_counter)+'.jpg', 'wb') as img:
            img.write(img_page.content)
        print(str(img_counter+1)+' of '+str(len(new_img_url))+' pics'+' are done!')
    except rq.exceptions.ConnectionError as e:
        e_str = str(e)
        print('Error type: %s! Jump to next image URL!' %e_str)
    finally:
        img_counter += 1
print('---------------------------------')
print('Done with %d pics!' %img_counter)
os.system('pause')

2018.8.17

东敲西改，零零碎碎写了几天之后完成了从Pixiv上爬取原图的代码，顺便感谢一下Pixiv官方终于支持在搜索时使用数字表示热度了，对过滤图片来说方便了不少（笑）。

至此，爬图的小项目就告一段落吧~