1. Scrapy概述

1. 为什么要学习scrapy框架

  • 爬虫必备的技术,面试会问相关的知识。
  • 让我们的爬虫更快更强大。(支持异步爬虫)

2. 什么是Scrapy?

在这里插入图片描述

  • 异步爬虫框架:Scrapy是一个基于Python开发的爬虫框架,用于抓取网站并从其页面中提取结构化数据,也是当前Python爬虫生态中最流行的爬虫框架,Scrapy框架架构清晰,可扩展性强,可以灵活高效的完成各种爬虫需求。
    程序状态转换图:
    在这里插入图片描述

3. 如何学习Scrapy?

4. Scrapy工作流程

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

分工介绍表
版块 介绍 要求
Scrapy engine(引擎) 总指挥:负责数据和信号在不同模块之间的传递 Scrapy已经实现
Scheduler(调度器) 一个队列,存放引擎发过来的request请求 Scrapy已经实现
Downoader(下载器) 下载引擎发过来的requests请求的源码(即response),将源码返回给引擎 scrapy已经实现
Spider(爬虫) 处理引擎发来的response,提取数据,提取url,并交给引擎 需要手写
Item Pipline(管道) 处理引擎传过来的数据,比如存储数据 需要手写
Downloader Middlewares(下载中间件) 可以自定义的下载扩展,比如设置代理 一般不用手写
Spider Middlewares(中间件) 可以自定义requests请求和进行response过滤 一般不用手写

2. Scrapy快速入门(小案例)

1. 安装

pip install scrapy
pip install scrapy==2.5.1   # 指定安装2.5.1版本的scrapy

在这里插入图片描述
在终端内输入“scrapy”命令验证是否安装好了:
在这里插入图片描述
以上显示就说明已经安装好了。

2. 创建项目

  • 需要进入到项目保存位置的cmd中。
    请添加图片描述
# scrapy startproject 项目名称
scrapy startproject my_Scrapy

在这里插入图片描述

3. 项目结构分析

在这里插入图片描述

  • my_Scrapy
    • my_Scrapy
      • spiders
        • __init__.py
      • __init__.py
      • items.py
      • middlewares.py
      • piplelines.py
      • settings.py
    • scrapy.cfg
      在这里插入图片描述

功能介绍:

  • scrapy.cfg:Scrapy项目配置文件,定义项目的配置文件的路径,部署信息。(一般不需要改)
  • items.py:定义了item的数据结构,所有item的定义都可以放在这里。(定义爬取的数据内容有哪些)
  • piplelines.py:定义item Pipeline的实现。
  • settings.py:定义项目的全局配置。
  • middlewares.py:中间件文件,定义了Spider Middlewares和Downloader Middlewares的实现。
  • spiders:里面包含一个个spider(爬虫)的实现。每一个spider对应一个.py文件。

4. 创建Spider

# 先进入项目目录:
cd my_Scrapy
# scrapy genspider 爬虫文件名 爬取网站的域名
scrapy genspider spider1 www.baidu.com

在这里插入图片描述
在这里插入图片描述

  • 修改spider1.py文件:
import scrapy


class Spider1Spider(scrapy.Spider):
    # spider(爬虫)名称,需要记住,通过名字来启动爬虫:
    name = 'spider1'
    # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
    allowed_domains = ['http://quotes.toscrape.com/']
    # 初始request请求:
    start_urls = ['http://quotes.toscrape.com/']

    # 解析方法:
    def parse(self, response):
        print(response.text)

官方参考案例网站:http://quotes.toscrape.com/

5. 创建item

  • item是保存爬取数据的容器,定义爬取的数据结构。
    修改项目中的items.py文件如下:
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class MyScrapyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # 采集的目标内容:名言、名人、分类标签
    # 名言:
    text = scrapy.Field()
    # 名人:
    author = scrapy.Field()
    # 标签:
    tags = scrapy.Field()

6. 解析Response

1. 仅仅爬取第一页的数据
  • 修改spider1.py文件中的parse()方法,该方法用于解析源码中的目标内容。
import scrapy
from lxml import etree


class Spider1Spider(scrapy.Spider):
    # spider(爬虫)名称,需要记住,通过名字来启动爬虫:
    name = 'spider1'
    # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
    allowed_domains = ['http://quotes.toscrape.com/']
    # 初始request请求:
    start_urls = ['http://quotes.toscrape.com/']

    # 解析方法:
    def parse(self, response):
        # print(response.text)
        """
        解析返回响应,提取数据或进一步生成要处理的请求。
        :param response:
        :return:
        """
        # 方法一:通过css选择器来进行解析
        """ 
        quotes = response.css('.quote')   # 列表
        for quote in quotes:
            # text = quote.css('span.text::text')   # ::text表示获取里面的文本(注意它是一个对象)
            # 旧方法:
            # extract_first()  返回第一个数据。(字符串)
            # extract()  返回全部的数据。(字符串列表)
            text = quote.css('span.text::text').extract_first()   # ::text表示获取里面的文本  (extract_first()方法可以获取到对象里面的文本内容)
            # author = quote.css('small.author::text')  # 作者  (得到一个对象)
            author = quote.css('small.author::text').extract_first()  # 作者  (得到文本内容)
            # tags = quote.css('div.tags a.tag::text')   # 标签(CSS对象)
            tags = quote.css('div.tags a.tag::text').extract()   # 返回全部的数据
            # print(tags)
            # print(text, '      ——————', author, tags)

            # 新方法:
            # get()   返回一条数据
            # getall()  返回全部的数据
            text = quote.css('span.text::text').get()
            author = quote.css('small.author::text').get()
            tags = quote.css('div.tags a.tag::text').getall()
            print(text, tags, '      ——————', author)
        """

        # 方法二:通过xpath进行解析
        html = etree.HTML(response.text)
        quotes_divs = html.xpath('//div[@class="quote"]')
        for quote_div in quotes_divs:
            text = quote_div.xpath('./span[1]/text()')[0]
            author = quote_div.xpath('./span[2]/small/text()')[0]
            tags = quote_div.xpath('./div[@class="tags"]/a/text()')
            print(text, tags, '    ------', author)

运行start.py文件的结果:

D:\Anaconda\python.exe C:/Users/lv/Desktop/scrapy框架的学习/my_Scrapy/start.py
2022-04-03 19:24:47 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: my_Scrapy)
2022-04-03 19:24:47 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 18.7.0, Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.19041-SP0
2022-04-03 19:24:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-04-03 19:24:47 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'my_Scrapy',
 'NEWSPIDER_MODULE': 'my_Scrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['my_Scrapy.spiders']}
2022-04-03 19:24:47 [scrapy.extensions.telnet] INFO: Telnet Password: e2250e171a87ebd6
2022-04-03 19:24:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-04-03 19:24:47 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-04-03 19:24:47 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-03 19:24:47 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-04-03 19:24:47 [scrapy.core.engine] INFO: Spider opened
2022-04-03 19:24:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-04-03 19:24:47 [py.warnings] WARNING: D:\Anaconda\lib\site-packages\scrapy\spidermiddlewares\offsite.py:65: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry http://quotes.toscrape.com/ in allowed_domains.
  warnings.warn(message, URLWarning)

2022-04-03 19:24:47 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-03 19:24:48 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2022-04-03 19:24:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
2022-04-03 19:24:48 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-03 19:24:48 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 448,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2582,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 1.264597,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 4, 3, 11, 24, 48, 658256),
 'httpcompression/response_bytes': 11053,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 4, 3, 11, 24, 47, 393659)}
2022-04-03 19:24:48 [scrapy.core.engine] INFO: Spider closed (finished)
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.['change', 'deep-thoughts', 'thinking', 'world']     ------ Albert Einstein
“It is our choices, Harry, that show what we truly are, far more than our abilities.['abilities', 'choices']     ------ J.K. Rowling
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.['inspirational', 'life', 'live', 'miracle', 'miracles']     ------ Albert Einstein
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.['aliteracy', 'books', 'classic', 'humor']     ------ Jane Austen
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.” ['be-yourself', 'inspirational']     ------ Marilyn Monroe
“Try not to become a man of success. Rather become a man of value.['adulthood', 'success', 'value']     ------ Albert Einstein
“It is better to be hated for what you are than to be loved for what you are not.['life', 'love']     ------ André Gide
“I have not failed. I've just found 10,000 ways that won't work.['edison', 'failure', 'inspirational', 'paraphrased']     ------ Thomas A. Edison
“A woman is like a tea bag; you never know how strong it is until it's in hot water.” ['misattributed-eleanor-roosevelt']     ------ Eleanor Roosevelt
“A day without sunshine is like, you know, night.['humor', 'obvious', 'simile']     ------ Steve Martin

Process finished with exit code 0

2. 翻页爬取数据
  • 修改的主要是spider1.py文件中的部分语句。
import scrapy
from lxml import etree
from my_Scrapy.items import MyScrapyItem   # 从items文件中导入MyScrapyItem类


class Spider1Spider(scrapy.Spider):
    # spider(爬虫)名称,需要记住,通过名字来启动爬虫:
    name = 'spider1'
    # # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
    # allowed_domains = ['http://quotes.toscrape.com/']   # 不设置限制,就可以一直爬取下一页了
    # 初始request请求:
    start_urls = ['http://quotes.toscrape.com/']

    # 解析方法:
    def parse(self, response):
        # print(response.text)
        """
        解析返回响应,提取数据或进一步生成要处理的请求。
        :param response:
        :return:
        """
        # 方法一:通过css选择器来进行解析
        """ 
        quotes = response.css('.quote')   # 列表
        for quote in quotes:
            # text = quote.css('span.text::text')   # ::text表示获取里面的文本(注意它是一个对象)
            # 旧方法:
            # extract_first()  返回第一个数据。(字符串)
            # extract()  返回全部的数据。(字符串列表)
            text = quote.css('span.text::text').extract_first()   # ::text表示获取里面的文本  (extract_first()方法可以获取到对象里面的文本内容)
            # author = quote.css('small.author::text')  # 作者  (得到一个对象)
            author = quote.css('small.author::text').extract_first()  # 作者  (得到文本内容)
            # tags = quote.css('div.tags a.tag::text')   # 标签(CSS对象)
            tags = quote.css('div.tags a.tag::text').extract()   # 返回全部的数据
            # print(tags)
            # print(text, '      ——————', author, tags)

            # 新方法:
            # get()   返回一条数据
            # getall()  返回全部的数据
            text = quote.css('span.text::text').get()
            author = quote.css('small.author::text').get()
            tags = quote.css('div.tags a.tag::text').getall()
            print(text, tags, '      ——————', author)
        """

        # 方法二:通过xpath进行解析
        html = etree.HTML(response.text)
        quotes_divs = html.xpath('//div[@class="quote"]')
        for quote_div in quotes_divs:
            text = quote_div.xpath('./span[1]/text()')[0]
            author = quote_div.xpath('./span[2]/small/text()')[0]
            tags = quote_div.xpath('./div[@class="tags"]/a/text()')
            # print(text, tags, '    ------', author)

            # 将数据放入容器中,便于保存数据
            item = MyScrapyItem()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            # print(item)

            # 通过迭代的方式将字典中的每一条数据交给Pipeline
            yield item

        # 定义翻页操作
        next = response.css('ul.pager li.next a::attr("href")').get()   # 获得“下一页”a标签按钮上面href属性值
        print(next)  # /page/2/
        url = self.start_urls[0]   # 获取当前爬取的url(方法一)
        # print(url)
        url = response.url    # 获取当前爬取的url(方法二)
        # print(url)
        # 拼接形成下一页的url
        url = response.urljoin(next)
        print(url)
        # 将请求交给调度器,重新构造下一个请求
        yield scrapy.Request(url, callback=self.parse)   # 对于新的请求,仍然执行parse()解析方法

运行结果部分截图:
在这里插入图片描述

7. 保存数据

1. 通过执行scrapy命令进行保存数据
1. 方式一:在终端执行命令
# scrapy crawl 爬虫文件名 -o 数据保存文件名
scrapy crawl spider1 -o demo.csv

在这里插入图片描述

2. 方式二:通过修改start.py启动文件的cmd命令行语句
# 使用Scrapy框架的该爬虫是一个项目,不能在爬虫文件中用右键运行,需要在终端里面输入scrapy crawl 爬虫文件名“的命令进行启动。
# 如果不想用终端输入命令的启动方式,可以创建该“start.py”文件
from scrapy import cmdline

# cmdline.execute('scrapy crawl spider1'.split())   # 调用终端命令
cmdline.execute('scrapy crawl spider1 -o demo.csv'.split())

# 红色的不是报错,而是scrapy框架自行打印的初始化信息。白色的内容就是print()语句输出的内容。

在这里插入图片描述

2. 通过自定义的方式保存(修改pipelines.py文件)
  1. 修改pipelines.py文件如下:
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class MyScrapyPipeline:
    def process_item(self, item, spider):
        with open('demo.txt', 'a', encoding="utf-8") as f:
            f.write(item['text'] + '           ——' + item['author'] + "\n")
        return item

  1. 将settings.py文件中pipelines.py文件对应的注释取消掉(否则就无法成功将数据保存在txt文件中)
# Scrapy settings for my_Scrapy project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'my_Scrapy'

SPIDER_MODULES = ['my_Scrapy.spiders']
NEWSPIDER_MODULE = 'my_Scrapy.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'my_Scrapy (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'my_Scrapy.middlewares.MyScrapySpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'my_Scrapy.middlewares.MyScrapyDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'my_Scrapy.pipelines.MyScrapyPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

  1. 运行结果截图:
    在这里插入图片描述

8. 运行项目

1. 在终端内运行
# scrapy crawl 爬虫文件名
scrapy crawl spider1

最前面是爬虫运行的提示信息:
在这里插入图片描述
中间的就是网页源代码:
在这里插入图片描述
最后面是关闭爬虫的提示信息:
在这里插入图片描述

2. 通过PyCharm运行

需要在项目文件夹下创建一个启动项目的文件start.py:

# 使用Scrapy框架的该爬虫是一个项目,不能在爬虫文件中用右键运行,需要在终端里面输入scrapy crawl 爬虫文件名“的命令进行启动。
# 如果不想用终端输入命令的启动方式,可以创建该“start.py”文件
from scrapy import cmdline

cmdline.execute('scrapy crawl spider1'.split())   # 调用终端命令

# 红色的不是报错,而是爬虫的初始化信息。白色的内容就是print()语句输出的内容

运行结果:

D:\Anaconda\python.exe C:/Users/lv/Desktop/scrapy框架的学习/my_Scrapy/start.py
2022-04-03 16:34:35 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: my_Scrapy)
2022-04-03 16:34:35 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 18.7.0, Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.19041-SP0
2022-04-03 16:34:35 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-04-03 16:34:35 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'my_Scrapy',
 'NEWSPIDER_MODULE': 'my_Scrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['my_Scrapy.spiders']}
2022-04-03 16:34:35 [scrapy.extensions.telnet] INFO: Telnet Password: b9d4a8fccbb5b978
2022-04-03 16:34:35 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-04-03 16:34:35 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-04-03 16:34:35 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-03 16:34:35 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-04-03 16:34:35 [scrapy.core.engine] INFO: Spider opened
2022-04-03 16:34:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-04-03 16:34:35 [py.warnings] WARNING: D:\Anaconda\lib\site-packages\scrapy\spidermiddlewares\offsite.py:65: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry http://quotes.toscrape.com/ in allowed_domains.
  warnings.warn(message, URLWarning)

2022-04-03 16:34:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-03 16:34:36 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2022-04-03 16:34:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" /    > 
            
            <a class="tag" href="/tag/change/page/1/">change</a>
            
            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
            
            <a class="tag" href="/tag/thinking/page/1/">thinking</a>
            
            <a class="tag" href="/tag/world/page/1/">world</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.</span>
        <span>by <small class="author" itemprop="author">J.K. Rowling</small>
        <a href="/author/J-K-Rowling">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="abilities,choices" /    > 
            
            <a class="tag" href="/tag/abilities/page/1/">abilities</a>
            
            <a class="tag" href="/tag/choices/page/1/">choices</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="inspirational,life,live,miracle,miracles" /    > 
            
            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
            
            <a class="tag" href="/tag/life/page/1/">life</a>
            
            <a class="tag" href="/tag/live/page/1/">live</a>
            
            <a class="tag" href="/tag/miracle/page/1/">miracle</a>
            
            <a class="tag" href="/tag/miracles/page/1/">miracles</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.</span>
        <span>by <small class="author" itemprop="author">Jane Austen</small>
        <a href="/author/Jane-Austen">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor" /    > 
            
            <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>
            
            <a class="tag" href="/tag/books/page/1/">books</a>
            
            <a class="tag" href="/tag/classic/page/1/">classic</a>
            
            <a class="tag" href="/tag/humor/page/1/">humor</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it&#39;s better to be absolutely ridiculous than absolutely boring.”</span>
        <span>by <small class="author" itemprop="author">Marilyn Monroe</small>
        <a href="/author/Marilyn-Monroe">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="be-yourself,inspirational" /    > 
            
            <a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>
            
            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="adulthood,success,value" /    > 
            
            <a class="tag" href="/tag/adulthood/page/1/">adulthood</a>
            
            <a class="tag" href="/tag/success/page/1/">success</a>
            
            <a class="tag" href="/tag/value/page/1/">value</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.</span>
        <span>by <small class="author" itemprop="author">André Gide</small>
        <a href="/author/Andre-Gide">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="life,love" /    > 
            
            <a class="tag" href="/tag/life/page/1/">life</a>
            
            <a class="tag" href="/tag/love/page/1/">love</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“I have not failed. I&#39;ve just found 10,000 ways that won&#39;t work.”</span>
        <span>by <small class="author" itemprop="author">Thomas A. Edison</small>
        <a href="/author/Thomas-A-Edison">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="edison,failure,inspirational,paraphrased" /    > 
            
            <a class="tag" href="/tag/edison/page/1/">edison</a>
            
            <a class="tag" href="/tag/failure/page/1/">failure</a>
            
            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
            
            <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it&#39;s in hot water.”</span>
        <span>by <small class="author" itemprop="author">Eleanor Roosevelt</small>
        <a href="/author/Eleanor-Roosevelt">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="misattributed-eleanor-roosevelt" /    > 
            
            <a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>
            
        </div>
    </div>

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“A day without sunshine is like, you know, night.</span>
        <span>by <small class="author" itemprop="author">Steve Martin</small>
        <a href="/author/Steve-Martin">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="humor,obvious,simile" /    > 
            
            <a class="tag" href="/tag/humor/page/1/">humor</a>
            
            <a class="tag" href="/tag/obvious/page/1/">obvious</a>
            
            <a class="tag" href="/tag/simile/page/1/">simile</a>
            
        </div>
    </div>

    <nav>
        <ul class="pager">
            
            
            <li class="next">
                <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
            </li>
            
        </ul>
    </nav>
    </div>
    <div class="col-md-4 tags-box">
        
            <h2>Top Ten tags</h2>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 28px" href="/tag/love/">love</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 26px" href="/tag/inspirational/">inspirational</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 26px" href="/tag/life/">life</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 24px" href="/tag/humor/">humor</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 22px" href="/tag/books/">books</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 14px" href="/tag/reading/">reading</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 10px" href="/tag/friendship/">friendship</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 8px" href="/tag/friends/">friends</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 8px" href="/tag/truth/">truth</a>
            </span>
            
            <span class="tag-item">
            <a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>
            </span>
            
        
    </div>
</div>

    </div>
    <footer class="footer">
        <div class="container">
            <p class="text-muted">
                Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>
            </p>
            <p class="copyright">
                Made with <span class='sh-red'></span> by <a href="https://scrapinghub.com">Scrapinghub</a>
            </p>
        </div>
    </footer>
</body>
</html>
2022-04-03 16:34:36 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-03 16:34:36 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 448,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2578,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 1.29309,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 4, 3, 8, 34, 36, 608493),
 'httpcompression/response_bytes': 11053,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 4, 3, 8, 34, 35, 315403)}
2022-04-03 16:34:36 [scrapy.core.engine] INFO: Spider closed (finished)

Process finished with exit code 0

3. srcapy shell 的使用

1. 在终端内使用scrapy shell命令进行单次请求内容提取的测试

爬取网址:https://docs.scrapy.org/en/latest/_static/selectors-sample1.html
在终端内输入命令:

scrapy shell https://docs.scrapy.org/en/latest/_st
atic/selectors-sample1.html
Microsoft Windows [版本 10.0.19042.1586]
(c) Microsoft Corporation。保留所有权利。
(base) C:\Users\吕成鑫\Desktop\scrapy框架的学习\my_Scr
apy>scrapy shell https://docs.scrapy.org/en/latest/_st
atic/selectors-sample1.html
2022-04-04 20:04:11 [scrapy.utils.log] INFO: Scrapy 2.
5.1 started (bot: my_Scrapy)
2022-04-04 20:04:11 [scrapy.utils.log] INFO: Versions:
 lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.1.0, parsel
1.6.0, w3lib 1.22.0, Twisted 18.7.0, Python 3.7.0 (def
ault, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64
)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p  14 Aug 2018), cr
yptography 2.3.1, Platform Windows-10-10.0.19041-SP0
2022-04-04 20:04:11 [scrapy.utils.log] DEBUG: Using re
actor: twisted.internet.selectreactor.SelectReactor
2022-04-04 20:04:11 [scrapy.crawler] INFO: Overridden
settings:
{'BOT_NAME': 'my_Scrapy',
 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilte
r',
 'LOGSTATS_INTERVAL': 0,
 'NEWSPIDER_MODULE': 'my_Scrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['my_Scrapy.spiders']}
2022-04-04 20:04:11 [scrapy.extensions.telnet] INFO: T
elnet Password: 358ca5f9dee7f2d7
2022-04-04 20:04:11 [scrapy.middleware] INFO: Enabled
extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2022-04-04 20:04:11 [scrapy.middleware] INFO: Enabled
downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMidd
leware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddle
ware',
 'scrapy.downloadermiddlewares.downloadtimeout.Downloa
dTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultH
eadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMidd
leware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',

 'scrapy.downloadermiddlewares.redirect.MetaRefreshMid
dleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCom
pressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddle
ware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddlewa
re',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMidd
leware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']

2022-04-04 20:04:11 [scrapy.middleware] INFO: Enabled
spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddlewa
re',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',

 'scrapy.spidermiddlewares.referer.RefererMiddleware',

 'scrapy.spidermiddlewares.urllength.UrlLengthMiddlewa
re',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-04 20:04:11 [scrapy.middleware] INFO: Enabled
item pipelines:
['my_Scrapy.pipelines.MyScrapyPipeline']
2022-04-04 20:04:11 [scrapy.extensions.telnet] INFO: T
elnet console listening on 127.0.0.1:6023
2022-04-04 20:04:11 [scrapy.core.engine] INFO: Spider
opened
2022-04-04 20:04:12 [scrapy.core.engine] DEBUG: Crawle
d (200) <GET https://docs.scrapy.org/robots.txt> (refe
rer: None)
2022-04-04 20:04:14 [scrapy.core.engine] DEBUG: Crawle
d (200) <GET https://docs.scrapy.org/en/latest/_static
/selectors-sample1.html> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Reques
t, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x0
0000229978B6E80>
[s]   item       {}
[s]   request    <GET https://docs.scrapy.org/en/lates
t/_static/selectors-sample1.html>
[s]   response   <200 https://docs.scrapy.org/en/lates
t/_static/selectors-sample1.html>
[s]   settings   <scrapy.settings.Settings object at 0
x00000229978B6A20>
[s]   spider     <DefaultSpider 'default' at 0x22997d9
8898>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update
 local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Reque
st and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]: response
Out[1]: <200 https://docs.scrapy.org/en/latest/_static
/selectors-sample1.html>

In [2]: response.text
Out[2]: "<html>\n <head>\n  <base href='http://example
.com/' />\n  <title>Example website</title>\n </head>\
n <body>\n  <div id='images'>\n   <a href='image1.html
'>Name: My image 1 <br /><img src='image1_thumb.jpg' /
></a>\n   <a href='image2.html'>Name: My image 2 <br /
><img src='image2_thumb.jpg' /></a>\n   <a href='image
3.html'>Name: My image 3 <br /><img src='image3_thumb.
jpg' /></a>\n   <a href='image4.html'>Name: My image 4
 <br /><img src='image4_thumb.jpg' /></a>\n   <a href=
'image5.html'>Name: My image 5 <br /><img src='image5_
thumb.jpg' /></a>\n  </div>\n </body>\n</html>\n\n"

In [3]: response.xpath('//a')
Out[3]: 
[<Selector xpath='//a' data='<a href="image1.html">Nam
e: My image ...'>,
 <Selector xpath='//a' data='<a href="image2.html">Nam
e: My image ...'>,
 <Selector xpath='//a' data='<a href="image3.html">Nam
e: My image ...'>,
 <Selector xpath='//a' data='<a href="image4.html">Nam
e: My image ...'>,
 <Selector xpath='//a' data='<a href="image5.html">Nam
e: My image ...'>]

In [4]: response.xpath('//a').xpath('./img')
Out[4]: 
[<Selector xpath='./img' data='<img src="image1_thumb.
jpg">'>,
 <Selector xpath='./img' data='<img src="image2_thumb.
jpg">'>,
 <Selector xpath='./img' data='<img src="image3_thumb.
jpg">'>,
 <Selector xpath='./img' data='<img src="image4_thumb.
jpg">'>,
 <Selector xpath='./img' data='<img src="image5_thumb.
jpg">'>]

In [5]: response.xpath('//a').xpath('./img')[0]
Out[5]: <Selector xpath='./img' data='<img src="image1
_thumb.jpg">'>

In [6]: response.xpath('//a').xpath('./img').getall()
   ...: 
Out[6]: 
['<img src="image1_thumb.jpg">',
 '<img src="image2_thumb.jpg">',
 '<img src="image3_thumb.jpg">',
 '<img src="image4_thumb.jpg">',
 '<img src="image5_thumb.jpg">']

In [7]: response.xpath('//a').xpath('./img').get()
Out[7]: '<img src="image1_thumb.jpg">'

In [8]: result = response.xpath('//a')

In [9]: result
Out[9]: 
[<Selector xpath='//a' data='<a href="image1.html">Nam
e: My image ...'>,
 <Selector xpath='//a' data='<a href="image2.html">Nam
e: My image ...'>,
 <Selector xpath='//a' data='<a href="image3.html">Nam
e: My image ...'>,
 <Selector xpath='//a' data='<a href="image4.html">Nam
e: My image ...'>,
 <Selector xpath='//a' data='<a href="image5.html">Nam
e: My image ...'>]

In [10]: result.xpath('./img').getall()
Out[10]: 
['<img src="image1_thumb.jpg">',
 '<img src="image2_thumb.jpg">',
 '<img src="image3_thumb.jpg">',
 '<img src="image4_thumb.jpg">',
 '<img src="image5_thumb.jpg">']

In [11]: response.xpath("//img")
Out[11]: 
[<Selector xpath='//img' data='<img src="image1_thumb.
jpg">'>,
 <Selector xpath='//img' data='<img src="image2_thumb.
jpg">'>,
 <Selector xpath='//img' data='<img src="image3_thumb.
jpg">'>,
 <Selector xpath='//img' data='<img src="image4_thumb.
jpg">'>,
 <Selector xpath='//img' data='<img src="image5_thumb.
jpg">'>]

In [12]: response.css('a')
Out[12]: 
[<Selector xpath='descendant-or-self::a' data='<a href
="image1.html">Name: My image ...'>,
 <Selector xpath='descendant-or-self::a' data='<a href
="image2.html">Name: My image ...'>,
 <Selector xpath='descendant-or-self::a' data='<a href
="image3.html">Name: My image ...'>,
 <Selector xpath='descendant-or-self::a' data='<a href
="image4.html">Name: My image ...'>,
 <Selector xpath='descendant-or-self::a' data='<a href
="image5.html">Name: My image ...'>]

In [13]: response.css('div#images')
Out[13]: [<Selector xpath="descendant-or-self::div[@id
 = 'images']" data='<div id="images">\n   <a href="ima
ge1....'>]

In [14]: response.css('div#images').get()
Out[14]: '<div id="images">\n   <a href="image
1.html">Name: My image 1 <br><img src="image1_
thumb.jpg"></a>\n   <a href="image2.html">Name
: My image 2 <br><img src="image2_thumb.jpg"><
/a>\n   <a href="image3.html">Name: My image 3
 <br><img src="image3_thumb.jpg"></a>\n   <a h
html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>\n  </div>'

In [15]: response.xpath('//a/text()').re('Name:\s(.*)')
Out[15]: ['My image 1 ', 'My image 2 ', 'My image 3 ', 'My image 4 ', 'My image
 5 ']

In [16]: response.re('.*')      # 不能这样直接使用re,需要在解析的内容后面使用re正则表达式
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-16-a22dedc07090> in <module>()
----> 1 response.re('.*')

AttributeError: 'HtmlResponse' object has no attribute 're'

In [17]: 

4. 实现翻页功能

如何翻页?

  • 回忆:

    • requests模块时如何发送翻页的请求的?
      • 1.找到下一页的地址
      • 2.之后调用requests.get(url)
  • 思路:

    • 1.找到下一页的地址
    • 2.构造一个关于下一页url地址的request请求传递给调度器

1. 通过在最后进行拼接成url和回调实现翻页

import scrapy
from lxml import etree
from my_Scrapy.items import MyScrapyItem   # 从items文件中导入MyScrapyItem类


class Spider2Spider(scrapy.Spider):
    # spider(爬虫)名称,需要记住,通过名字来启动爬虫:
    name = 'spider2'
    # # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
    # allowed_domains = ['quotes.toscrape.com/']   # 不设置限制,就可以一直爬取下一页了
    # 初始request请求:
    base_url = 'http://quotes.toscrape.com/page/{}/'
    page = 1
    start_urls = [base_url.format(page)]

    # 解析方法:
    def parse(self, response):
        # print(response.text)
        """
        解析返回响应,提取数据或进一步生成要处理的请求。
        :param response:
        :return:
        """
        # 方法一:通过css选择器来进行解析
        """ 
        quotes = response.css('.quote')   # 列表
        for quote in quotes:
            # text = quote.css('span.text::text')   # ::text表示获取里面的文本(注意它是一个对象)
            # 旧方法:
            # extract_first()  返回第一个数据。(字符串)
            # extract()  返回全部的数据。(字符串列表)
            text = quote.css('span.text::text').extract_first()   # ::text表示获取里面的文本  (extract_first()方法可以获取到对象里面的文本内容)
            # author = quote.css('small.author::text')  # 作者  (得到一个对象)
            author = quote.css('small.author::text').extract_first()  # 作者  (得到文本内容)
            # tags = quote.css('div.tags a.tag::text')   # 标签(CSS对象)
            tags = quote.css('div.tags a.tag::text').extract()   # 返回全部的数据
            # print(tags)
            # print(text, '      ——————', author, tags)

            # 新方法:
            # get()   返回一条数据
            # getall()  返回全部的数据
            text = quote.css('span.text::text').get()
            author = quote.css('small.author::text').get()
            tags = quote.css('div.tags a.tag::text').getall()
            print(text, tags, '      ——————', author)
        """

        # 方法二:通过xpath进行解析
        html = etree.HTML(response.text)
        quotes_divs = html.xpath('//div[@class="quote"]')
        for quote_div in quotes_divs:
            text = quote_div.xpath('./span[1]/text()')[0]
            author = quote_div.xpath('./span[2]/small/text()')[0]
            tags = quote_div.xpath('./div[@class="tags"]/a/text()')
            # print(text, tags, '    ------', author)

            # 将数据放入容器中,便于保存数据
            item = MyScrapyItem()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            # print(item)
            # 通过迭代的方式将字典中的每一条数据交给Pipeline
            yield item

        self.page += 1
        # 注意:需要控制翻页的结束
        if self.page < 11:
            # 构造下一个请求:(方法一)
            # yield scrapy.Request(self.base_url.format(self.page), callback=self.parse)

            # 构造下一个请求:(方法二)
            yield from response.follow_all(response.css('.pager .next a::attr("href")'), callback=self.parse)
        """
        # 原本定义的翻页操作:
        next = response.css('ul.pager li.next a::attr("href")').get()   # 获得“下一页”a标签按钮上面href属性值
        print(next)  # /page/2/
        url = self.start_urls[0]   # 获取当前爬取的url(方法一)
        # print(url)
        url = response.url    # 获取当前爬取的url(方法二)
        # print(url)
        # 拼接形成下一页的url
        url = response.urljoin(next)
        print(url)
        # 将请求交给调度器,重新构造下一个请求
        yield scrapy.Request(url, callback=self.parse)   # 对于新的请求,仍然执行parse()解析方法
        """

2. 通过重写strat_requests()方法实现翻页

import scrapy
from lxml import etree
from my_Scrapy.items import MyScrapyItem   # 从items文件中导入MyScrapyItem类


class Spider3Spider(scrapy.Spider):
    # spider(爬虫)名称,需要记住,通过名字来启动爬虫:
    name = 'spider3'
    # # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
    # allowed_domains = ['quotes.toscrape.com/']   # 不设置限制,就可以一直爬取下一页了
    # 初始request请求:
    base_url = 'http://quotes.toscrape.com/page/{}/'
    page = 1
    start_urls = [base_url.format(page)]

    # 通过封装方法的形式构造翻页功能:
    def start_requests(self):   # 在爬虫开始请求的时候会执行的操作
        for page in range(1, 11):
            url = self.base_url.format(page)
            yield scrapy.Request(url, callback=self.parse)

    # 解析方法:
    def parse(self, response):
        # print(response.text)
        """
        解析返回响应,提取数据或进一步生成要处理的请求。
        :param response:
        :return:
        """
        # 方法一:通过css选择器来进行解析
        """ 
        quotes = response.css('.quote')   # 列表
        for quote in quotes:
            # text = quote.css('span.text::text')   # ::text表示获取里面的文本(注意它是一个对象)
            # 旧方法:
            # extract_first()  返回第一个数据。(字符串)
            # extract()  返回全部的数据。(字符串列表)
            text = quote.css('span.text::text').extract_first()   # ::text表示获取里面的文本  (extract_first()方法可以获取到对象里面的文本内容)
            # author = quote.css('small.author::text')  # 作者  (得到一个对象)
            author = quote.css('small.author::text').extract_first()  # 作者  (得到文本内容)
            # tags = quote.css('div.tags a.tag::text')   # 标签(CSS对象)
            tags = quote.css('div.tags a.tag::text').extract()   # 返回全部的数据
            # print(tags)
            # print(text, '      ——————', author, tags)

            # 新方法:
            # get()   返回一条数据
            # getall()  返回全部的数据
            text = quote.css('span.text::text').get()
            author = quote.css('small.author::text').get()
            tags = quote.css('div.tags a.tag::text').getall()
            print(text, tags, '      ——————', author)
        """

        # 方法二:通过xpath进行解析
        html = etree.HTML(response.text)
        quotes_divs = html.xpath('//div[@class="quote"]')
        for quote_div in quotes_divs:
            text = quote_div.xpath('./span[1]/text()')[0]
            author = quote_div.xpath('./span[2]/small/text()')[0]
            tags = quote_div.xpath('./div[@class="tags"]/a/text()')
            # print(text, tags, '    ------', author)

            # 将数据放入容器中,便于保存数据
            item = MyScrapyItem()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            # print(item)
            # 通过迭代的方式将字典中的每一条数据交给Pipeline
            yield item

        """
        self.page += 1
        # 注意:需要控制翻页的结束
        if self.page < 11:
            # 构造下一个请求:(方法一)
            # yield scrapy.Request(self.base_url.format(self.page), callback=self.parse)

            # 构造下一个请求:(方法二)
            # 该方法是2.0版本之后出现的    拼接请求,进行回调
            yield from response.follow_all(response.css('.pager .next a::attr("href")'), callback=self.parse)
        """
        """
        # 原本定义的翻页操作:
        next = response.css('ul.pager li.next a::attr("href")').get()   # 获得“下一页”a标签按钮上面href属性值
        print(next)  # /page/2/
        url = self.start_urls[0]   # 获取当前爬取的url(方法一)
        # print(url)
        url = response.url    # 获取当前爬取的url(方法二)
        # print(url)
        # 拼接形成下一页的url
        url = response.urljoin(next)
        print(url)
        # 将请求交给调度器,重新构造下一个请求
        yield scrapy.Request(url, callback=self.parse)   # 对于新的请求,仍然执行parse()解析方法
        """

3. 修改start.py文件保存数据

# 使用Scrapy框架的该爬虫是一个项目,不能在爬虫文件中用右键运行,需要在终端里面输入scrapy crawl 爬虫文件名“的命令进行启动。
# 如果不想用终端输入命令的启动方式,可以创建该“start.py”文件
from scrapy import cmdline

# cmdline.execute('scrapy crawl spider1'.split())   # 调用终端命令
# cmdline.execute('scrapy crawl spider1 -o demo.csv'.split())
# cmdline.execute('scrapy crawl spider2'.split())   # 调用终端命令
cmdline.execute('scrapy crawl spider3'.split())   # 调用终端命令

# 红色的不是报错,而是爬虫的初始化信息。白色的内容就是输出内容

5. Scrapy框架-案例2

1. 分析网站

  1. 目标网站:腾讯招聘网站
  2. 目标:
    1. 爬取招聘岗位信息
    2. 翻页
      虚假的url:https://talent.antgroup.com/off-campus
  3. 数据加载方式:动态和静态
    抓包获取的含有数据的data-url:
    第1页:
    https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1649127078399&countryId=&cityId=&bgIds=&productId=&categoryId=40001001,40001002,40001003,40001004,40001005,40001006&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn
    第2页:
    https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1649127078399&countryId=&cityId=&bgIds=&productId=&categoryId=40001001,40001002,40001003,40001004,40001005,40001006&parentCategoryId=&attrId=&keyword=&pageIndex=2&pageSize=10&language=zh-cn&area=cn
    详情页:
    url:https://careers.tencent.com/jobdesc.html?postId=1310124481703845888
    data-url:https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1649156817199&postId=1310124481703845888&language=zh-cn
  4. 爬取思路:
    1. 第一页url
    2. 解析第一页上每个岗位对应postid
    3. 构造url

2. 实现步骤

  1. 创建项目
scrapy startproject tencent
  1. 创建爬虫程序
cd tencent
scrapy gensipider spider1 tencent.com

运行结果:

C:\Users\lv\Desktop\scrapy框架的学习>scrapy startproject tencent
New Scrapy project 'tencent', using template directory 'd:\anaconda\lib\site-packages\scrapy\templates\project', created in: C:\Users\lv\Desktop\scrapy框架的学习\tencent

You can start your first spider with:
    cd tencent
    scrapy genspider example example.com

C:\Users\lv\Desktop\scrapy框架的学习>cd tencent

C:\Users\lv\Desktop\scrapy框架的学习\tencent>scrapy genspider spider1 tencent.com
Created spider 'spider1' using template 'basic' in module:
  tencent.spiders.spider1

C:\Users\lv\Desktop\scrapy框架的学习\tencent>
  1. 用PyCharm打开tencent项目:
    在这里插入图片描述
  2. 在命令行使用如下命令生成一个spider1.py文件:
scrapy genspider spider1 tencent.com
  1. 编辑spider1.py文件如下:
import scrapy
import json
from tencent.items import TencentItem


class Spider1Spider(scrapy.Spider):
    name = 'spider1'
    allowed_domains = ['tencent.com']
    # 一页数据(10条)的url:需要更改页码值(pageIndex值)实现翻页获取
    one_url = "https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1649127078399&countryId=&cityId=&bgIds=&productId=&categoryId=40001001,40001002,40001003,40001004,40001005,40001006&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn"
    # 每一条数据页的url:需要更改postId值,实现获取另一条职位数据的url
    two_url = "https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1649156817199&postId={}&language=zh-cn"

    start_urls = [one_url.format(1)]

    # 解析列表页
    def parse(self, response):
        # 解析数据(由于获取到的不是网页源代码,而是数据包,是字典或json类型的)
        data = json.loads(response.text)
        for job in data['Data']['Posts']:
            item = TencentItem()

            post_id = job['PostId']
            # print(post_id)
            item['job_name'] = job['RecruitPostName']

            # 构建详情页url
            detail_url = self.two_url.format(post_id)
            print(detail_url)

            # 构造请求:
            yield scrapy.Request(url=detail_url, callback=self.parse_detail, meta={"item": item})

        # 翻页
        for page in range(2, 5):
            url = self.one_url.format(page)
            yield scrapy.Request(url, callback=self.parse)   # 翻页后解析的是列表页,而非详情页


    # 解析详情页面的数据
    def parse_detail(self, response):
        item = response.meta.get('item')
        # print(item)
        data = json.loads(response.text)
        item['job_duty'] = data['Data']['Requirement']

        yield item



  1. 打开settings.py文件下的pipelindes的注释:
    在这里插入图片描述
  2. 运行编写的start.py文件:
from scrapy import cmdline

# cmdline.execute("scrapy crawl spider1".split())
cmdline.execute("scrapy crawl spider1 -o demo.csv".split())
# cmdline.execute("scrapy crawl spider2".split())

运行结果如下:(会生成一个demo.csv文件)
在这里插入图片描述

补充一:Spider类的使用

1. Spider的运行流程
  1. 定义爬取网站的逻辑
  2. 分析爬取下来的页面
2. Spider类的分析
  • name:设置爬虫名称。
  • allowed_domains:允许访问的域名,防止爬虫爬到其他网址去。
  • start_urls:请求的url列表。
  • custom_settings:一个字典,专属于本spider的配置,这个配置会覆盖项目的全局配置,这个配置必须
  • crawler:该属性由from_crawler()方法设置,代表spider对应的爬虫对象。可以通过该属性来获取项目的配置信息。
  • closed:当前spider关闭时,方法会被调用,释放一些资源。

补充二:Request对象

1. 介绍
  • Request对象是在构造新的请求时需要用到的scrapy的一个对象。
    例如:
yield scrapy.Request(url=detail_url, callback=self.parse_detail)
2. 参数说明
  • url:新请求的url地址。该url会被放入队列中。
  • callback:回调的解析数据的函数。
  • priority:请求的优先级。(自定义队列中哪个url需要先被请求。)默认是0,调度器进行request调度时使用它。数值越大,越优先被调度执行。
  • method:请求方式,默认是“GET”。
  • dont_filter:是否需要重复请求,默认为False。
  • errback:设置请求发生错误后的处理方法,默认为None。(很少用到)
    例如:
    def parse(self, response):
    	...
        yield scrapy.Request(url=detail_url, callback=self.parse_detail, errback=self.func)

    def func(self):
        print("请求出现错误后执行的方法")
  • body:request内容。
  • headers:请求头。
  • cookies
  • meta:通过response携带参数进行传递。相当于额外附加的信息。
    例如:
    def parse(self, response):
        # 解析数据(由于获取到的不是网页源代码,而是数据包,是字典或json类型的)
        data = json.loads(response.text)
        for job in data['Data']['Posts']:
            item = TencentItem()

            post_id = job['PostId']
            # print(post_id)
            item['job_name'] = job['RecruitPostName']

            # 构建详情页url
            detail_url = self.two_url.format(post_id)
            print(detail_url)

            # 构造请求:
            yield scrapy.Request(url=detail_url, callback=self.parse_detail, meta={"item": item})

    # 解析详情页面的数据
    def parse_detail(self, response):
        item = response.meta.get('item')
        print(item)
  • encoding:编码格式,默认“utf-8”。
  • cb_kwargs:设置回调方法需要额外携带的参数,可以通过字典传递。
    例如:
	def parse(self, response):
			...
            # 构造请求:
            yield scrapy.Request(url=detail_url, callback=self.parse_detail, cb_kwargs={"num": 1})

    # 解析详情页面的数据
    def parse_detail(self, response, num):
        print(num)

补充三:CSS选择器


"""
解析工具:
    1. 正则表达式                效率高      语法难记
    2. xpath                   效率中等    语法中等
    3. BS4(bs语法和css选择器)     效率低     语法最简单
"""
from bs4 import BeautifulSoup
# 推荐一个第三方库: parsel
import parsel   # 内置了正则、xpath、css三种选择器


html = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/titllie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a wel.</p>

<p class="story">...</p>
"""
# 一、通过BeautifulSoup模块使用css选择器:
# 解析
# lxml是第三方的解析器,比起默认的html.parser解析器速度快很多
soup = BeautifulSoup(html, features="lxml")   # BeautifulSoup会自动补全不完整的html(例如加上<body>、</html>等)
# print(soup)

# 1. 通过标签名称进行查找
a_tags = soup.select('a')
print(a_tags)

# 2. 通过类名称进行查找
sister_class = soup.select('.sister')
print(sister_class)

# 3. 通过id名进行查找
link1_id = soup.select("#link1")
print(link1_id)

# 4. 组合查找
a_link2 = soup.select("p #link2")
print(a_link2)
a_link2 = soup.select("p > #link2")  # > 代表直接的下一级
print(a_link2)
p_sister_class = soup.select("p > .sister")
print(p_sister_class)
# 同一个标签的id和class不能同时用
# p_sister_class_id = soup.select("p > .sister#link1")
# print(p_sister_class_id)

# 5. 通过属性查找
a_href = soup.select('a[href="http://example.com/elsie"]')
print(a_href)

# 6. 获取标签内的文本内容
text1 = soup.select('title')[0].get_text()
print(text1)

# 7. 获取标签属性的值(如获取href属性的值)
href = soup.select('a#link1')[0]['href']
print(href)
print("---"*20)


# 二、通过parsel模块使用CSS选择器:
selector = parsel.Selector(html)   # 创建选择器对象
# selector.re()
# selector.xpath()
# selector.css()

# 1. 通过标签名查找
object_list = selector.css("a")
print(object_list.getall())   # getall()方法获取全部满足的结果
# for item in object_list:
#     print(item.get())

# 2. 通过类名称进行查找
print(selector.css('.sister').get())  # get()方法获取第一个满足条件的结果
print(selector.css('.sister').getall())

# 3. 通过id名进行查找
print(selector.css('#link1').getall())

# 4. 组合查找
print(selector.css('p.story a#link2').getall())


# 5. 通过属性查找
print(selector.css('.story').get())

# 6. 获取标签内的文本内容
print(selector.css('p > #link1::text').get())

# 7. 获取标签属性的值(如获取href属性的值)
print(selector.css('p > #link1::attr(href)').get())

# 8. 伪类选择器
print(selector.css('a').getall()[1])
print(selector.css('a:nth-child(1)').getall())  # 选择第几个

Logo

GitCode 天启AI是一款由 GitCode 团队打造的智能助手,基于先进的LLM(大语言模型)与多智能体 Agent 技术构建,致力于为用户提供高效、智能、多模态的创作与开发支持。它不仅支持自然语言对话,还具备处理文件、生成 PPT、撰写分析报告、开发 Web 应用等多项能力,真正做到“一句话,让 Al帮你完成复杂任务”。

更多推荐