小白学 Python 爬虫（40）：爬虫框架 Scrapy 入门基础（七）对接 Selenium 实战

极客挖掘机 2020-01-15 我要评论

人生苦短，我用 Python

前文传送门：

小白学 Python 爬虫（1）：开篇

小白学 Python 爬虫（2）：前置准备（一）基本类库的安装

小白学 Python 爬虫（3）：前置准备（二）Linux基础入门

小白学 Python 爬虫（4）：前置准备（三）Docker基础入门

小白学 Python 爬虫（5）：前置准备（四）数据库基础

小白学 Python 爬虫（6）：前置准备（五）爬虫框架的安装

小白学 Python 爬虫（7）：HTTP 基础

小白学 Python 爬虫（8）：网页基础

小白学 Python 爬虫（9）：爬虫基础

小白学 Python 爬虫（10）：Session 和 Cookies

小白学 Python 爬虫（11）：urllib 基础使用（一）

小白学 Python 爬虫（12）：urllib 基础使用（二）

小白学 Python 爬虫（13）：urllib 基础使用（三）

小白学 Python 爬虫（14）：urllib 基础使用（四）

小白学 Python 爬虫（15）：urllib 基础使用（五）

小白学 Python 爬虫（16）：urllib 实战之爬取妹子图

小白学 Python 爬虫（17）：Requests 基础使用

小白学 Python 爬虫（18）：Requests 进阶操作

小白学 Python 爬虫（19）：Xpath 基操

小白学 Python 爬虫（20）：Xpath 进阶

小白学 Python 爬虫（21）：解析库 Beautiful Soup（上）

小白学 Python 爬虫（22）：解析库 Beautiful Soup（下）

小白学 Python 爬虫（23）：解析库 pyquery 入门

小白学 Python 爬虫（24）：2019 豆瓣电影排行

小白学 Python 爬虫（25）：爬取股票信息

小白学 Python 爬虫（26）：为啥买不起上海二手房你都买不起

小白学 Python 爬虫（27）：自动化测试框架 Selenium 从入门到放弃（上）

小白学 Python 爬虫（28）：自动化测试框架 Selenium 从入门到放弃（下）

小白学 Python 爬虫（29）：Selenium 获取某大型电商网站商品信息

小白学 Python 爬虫（30）：代理基础

小白学 Python 爬虫（31）：自己构建一个简单的代理池

小白学 Python 爬虫（32）：异步请求库 AIOHTTP 基础入门

小白学 Python 爬虫（33）：爬虫框架 Scrapy 入门基础（一）

小白学 Python 爬虫（34）：爬虫框架 Scrapy 入门基础（二）

小白学 Python 爬虫（35）：爬虫框架 Scrapy 入门基础（三） Selector 选择器

小白学 Python 爬虫（36）：爬虫框架 Scrapy 入门基础（四） Downloader Middleware

小白学 Python 爬虫（39）： JavaScript 渲染服务 Scrapy-Splash 入门

引言

Scrapy 抓取页面的方式和 Requests 类库是一样的，都是直接模拟 HTTP 请求，对于由 JavaScript 动态渲染的页面就有些显得无能为力了。

我们前面抓取由 JavaScript 动态渲染的页面是使用 Selenium 对接浏览器进行页面抓取，当然，在 Scrapy 中同样也可以对接 Selenium 。

通过这种方案，我们无需关心一个页面加载是发送的请求，也无需关注页面的渲染过程，直接抓取最终结果就行，真正做到了可见即可抓。

示例

小目标

首先定一个小目标，前面的文章我们通过 Selenium 抓取了某东的商品信息，本篇我们依然使用这个站点，感谢某东为我们提供的素材。

准备

请各位同学确认自己本地已经正确安装 Scrapy 、 Selenium 以及 Selenium 所需要使用的一些驱动库，如果尚未安装的同学可以翻翻前面的文章。

新建项目

本篇内容还是新建一个新的 Scrapy 项目，并且命名为 scrapy_selenium_demo ，命令如下：

scrapy startproject scrapy_selenium_demo

记得找一个自己喜欢的目录，最好是纯英文目录。

然后新建一个 Spider ，命令如下：

scrapy genspider jd www.jd.com

记得顺手修改下 settings.py 中的配置，将 robots.txt 设置为 False ，否则我们无法抓取到相关的商品数据，因为在机器人协议中某东并不允许抓取商品数据，修改如下：

ROBOTSTXT_OBEY = False

定义数据结构

第一步还是我们将要抓取的数据结构定义到 Item ，代码如下：

import scrapy


class ProductItem(scrapy.Item):
    collection = 'products'
    image = scrapy.Field()
    price = scrapy.Field()
    name = scrapy.Field()
    commit = scrapy.Field()
    shop = scrapy.Field()
    icons = scrapy.Field()

这里我们定义了 6 个字段，和之前的示例完全相同，然后定一个了 collection ，这个是用于保存进数据的表的名称。

Spider

接下来，是我们的 Spider 的定义，先初步的定义一个 start_requests() 方法，后续还会有修改，示例如下：

# -*- coding: utf-8 -*-
from scrapy import Request, Spider


class JdSpider(Spider):
    name = 'jd'
    allowed_domains = ['www.jd.com']
    start_urls = ['http://www.jd.com/']

    def start_requests(self):
        base_url = 'https://search.jd.com/Search?keyword=iPhone&ev=exbrand_Apple'
        headers = {
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36',
            'referer': 'https://www.jd.com/'
        }
        for page in range(1, self.settings.get('MAX_PAGE') + 1, 2):
            url = base_url + '&page=' + str(page)
            yield Request(url=url, callback=self.parse, headers = headers)

最大的页码使用 MAX_PAGE 来表示，同样的这个配置需要添加至 settings.py 配置文件，如下：

MAX_PAGE = 3

在 start_requests() 中，我们通过 url 地址拼接的方式，遍历出来了所有我们需要访问的页面，因为某东的商品页面的翻页规则，这里我们使用的步长为 2 。

对接 Selenium

接下来我们需要对这些请求进行数据抓取，这里我们通过对接 Selenium 来完成。

具体的实现方案是使用 Download Middleware 来完成对接。示例代码如下：

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https:/https://img.qb5200.com/download-x/docs.scrapy.org/en/latest/topics/spider-middleware.html

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.http import HtmlResponse
from logging import getLogger


class SeleniumMiddleware(object):
    def __init__(self, timeout=None, service_args=[]):
        self.logger = getLogger(__name__)
        self.timeout = timeout
        # Chrome 开启无窗口模式
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument('--headless')

        self.driver = webdriver.Chrome(service_args=service_args, chrome_options=chrome_options)
        self.driver.set_window_size(1400, 700)
        self.driver.implicitly_wait(self.timeout)
        self.driver.set_page_load_timeout(self.timeout)
        self.wait = WebDriverWait(self.driver, self.timeout)


    def __del__(self):
        self.driver.close()


    def process_request(self, request, spider):
        self.logger.debug('Chrome is Starting')
        try:
            page = request.meta.get('page', 1)
            self.driver.get(request.url)
            if page > 1:
                input = self.wait.until(
                    EC.presence_of_element_located((By.XPATH, '//*[@id="J_bottomPage"]/span[2]/input')))
                button = self.wait.until(
                    EC.element_to_be_clickable((By.XPATH, '//*[@id="J_bottomPage"]/span[2]/a')))
                input.clear()
                input.send_keys(page)
                button.click()
            return HtmlResponse(url=request.url, body=self.driver.page_source, request=request, encoding='utf-8',
                                status=200)
        except TimeoutException:
            return HtmlResponse(url=request.url, status=500, request=request)

    @classmethod
    def from_crawler(cls, crawler):
        return cls(timeout=crawler.settings.get('SELENIUM_TIMEOUT'),
                   service_args=crawler.settings.get('CHROME_SERVICE_ARGS'))

写完 Download Middleware 需在 settings.py 中增加 Download Middleware 的相关配置，如下：

DOWNLOADER_MIDDLEWARES = {
   'scrapy_selenium_demo.middlewares.SeleniumMiddleware': 543,
}

解析页面

我们在 Download Middleware 中获得了 HtmlResponse ，这时需要在 Spider 中进行解析，如下：

def parse(self, response):
    products = response.css('#J_goodsList .gl-item .gl-i-wrap')
    for product in products:
        item = ProductItem()
        item['image'] = product.css('.p-img a img::attr("src")').extract_first()
        item['price'] = product.css('.p-price i::text').extract_first()
        item['name'] = product.css('.p-name em::text').extract_first()
        item['commit'] = product.css('.p-commit a::text').extract_first()
        item['shop'] = product.css('.p-shop a::text').extract_first()
        item['icons'] = product.css('.p-icons .goods-icons::text').extract_first()
        yield item

储存 MongoDB

我们增加一个 ITEM_PIPELINES MongoPipeline 将数据保存至 MongoDB ，如下：

import pymongo

class MongoPipeline(object):
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(mongo_uri=crawler.settings.get('MONGO_URI'),
                   mongo_db=crawler.settings.get('MONGO_DB')
                   )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def process_item(self, item, spider):
        name = item.__class__.__name__
        self.db[name].insert(dict(item))
        return item

    def close_spider(self, spider):
        self.client.close()

在 settings 中新增相关配置：

ITEM_PIPELINES = {
   'scrapy_selenium_demo.pipelines.MongoPipeline': 300,
}

至此，我们就完成主体程序，可以使用以下命令运行这只爬虫：

scrapy crawl jd

结果小编就不贴了，代码已上传代码仓库，有兴趣的同学可以访问代码仓库获取。

示例代码

本系列的所有代码小编都会放在代码管理仓库 Github 和 Gitee 上，方便大家取用。

示例代码-Github

示例代码-Gitee