crapy 爬虫框架的使用

大功告成！！！！打开python控制台，输入启动我们的爬虫。

落寞书生

1984人浏览 · 2024-12-13 15:08:00

落寞书生 · 2024-12-13 15:08:00 发布

1.scrapy框架安装

安装前先安装python3和pycharm 社区版

执行命令安装scrapy，

pip install scrapy

2.创建项目

执行命令：

scrapy startproject test_spider

如图：

3.使用pycharm大开项目并设置pipenv虚拟机环境

虚拟环境是为了依赖隔离，打开项目，如图：

点击设置，如图：

点击add interpreter，然后选择 pipenv 环境，如图：

然后选择OK，就设置成功了，如图：

4.爬取信息

在项目路径下执行命令：

scrapy genspider getjobsinfo 目标网址

如：scrapy genspider getjobsinfo 求职_找工作_招聘_2024年招聘信息-智联招聘

如图：

可以看到在spiders包下创建了一个getjobsinfo的pyhthon文件，这就是刚刚创建的爬虫。

爬虫代码编写：

from scrapy import Request, Selector

class GetjobsinfoSpider(scrapy.Spider):
    name = 'getjobsinfo'
    allowed_domains = ['zhaopin.com']

    # start_urls = ['https://www.zhaopin.com/']
    def start_requests(self):
        #  提交爬取路径交给引擎，开始爬取
        yield Request(url='https://www.zhaopin.com/')

    def parse(self, response, **kwargs):
        #  拿到响应体 使用xpath解析数据
        jobs_list = Selector(text=response.text).xpath(
            "//a[@class='zp-jobNavigater__pop--href']/text()").extract()  
        citys_list = Selector(text=response.text).xpath( 
            "//div[@class='footerFuncCity clearfix']/ul/li/strong/a/text()").extract()  
        print(jobs_list)
        print(citys_list)

5.重新安装scrapy

因为虚拟机的环境是隔离的，代码中找不到scrapy的库，所以要重新安装scrapy，如图：

使用命令安装，或者ide快捷安装，如图：

6.集成selenium

selenium一个浏览器内核，可以模拟浏览器的行为，解决反爬虫的网站数据抓取。
打开middlewares.py，编辑TestSpiderDownloaderMiddleware类。修改如下内容。

实现思路：拿到响应体后，使用BeautifulSoup4解析出网页的文本，如果文本数量小于200，就使用selenium重新爬取。

先要在虚拟环境中安装BeautifulSoup4和selenium，同时将Chrome驱动放入虚拟环境下的python根目录。如图：

驱动版本需要和安装的浏览器版本一致：
查看驱动版本：

下载驱动的链接地址：
CNPM Binaries Mirror
由于我的浏览器版本比较新，所有还未正式发布驱动，找了一个临时地址：
https://googlechromelabs.github.io/chrome-for-testing/

使用命令安装依赖：

pip install BeautifulSoup4
pip install selenium

修改TestSpiderDownloaderMiddleware类，导入依赖：

from bs4 import BeautifulSoup
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.common import TimeoutException

增加构造函数和析构函数，并且修改process_response函数：

    def __init__(self):
        # 在初始化方法中创建Chrome实例
        options = webdriver.ChromeOptions()
        options.add_argument('--headless')  # 设置无界面
        self.webdriver = webdriver.Chrome(options=options)

    def __del__(self):
        self.webdriver.close()  # 关闭窗口
        self.webdriver.quit()  # 关闭浏览器
    
    def process_response(self, request, response, spider):
        try:
            # 提取响应体文本
            pure_text = BeautifulSoup(response.body).get_text()
            if len(pure_text) < 200:
                print('Chrome driver begin...')
                self.webdriver.get(url=response.url)
                # wait = WebDriverWait(self.webdriver, timeout=20)
                return HtmlResponse(url=response.url, body=self.webdriver.page_source,
                                    encoding='utf-8')  # 返回selenium渲染之后的HTML数据
            else:
                return response
        except TimeoutException:
            return HtmlResponse(url=response.url, encoding='utf-8', status=500)
        finally:
            print('Chrome driver end...')

如图：

中间件修改完成后在settings.py中设置使用我们修改过的中间件，设置里默认有写，取消注释即可，TestSpiderDownloaderMiddleware是中间件的类名。
如图：

7.item接收数据

爬取到的数据需要使用item进行接收，以便进行下一步处理，在items.py中添加一个item。

class JobInfo(scrapy.Item):
    job_name = scrapy.Field() 
    job_salary = scrapy.Field()  
    job_place = scrapy.Field()  
    job_experience = scrapy.Field()  
    job_education = scrapy.Field()  
    job_tag = scrapy.Field() 
    company_name = scrapy.Field()  
    company_type = scrapy.Field() 
    company_scale = scrapy.Field()  
    link = scrapy.Field()

8.使用回调

这里有一个问题，下载到的页面还会使用当前的parse方法解析数据，这并不是我们所期望的，所以要在这里添加一个回调，使用其他方法解析这个Request，所以需要再写一个回调方法，使用该回调方法解析下一步的数据。

同时在该回调方法里解析数据，然后用item接收。

修改getjobsinfo.py的代码

import scrapy
from scrapy import Request, Selector
from test_spider.items import JobInfo
class GetjobsinfoSpider(scrapy.Spider):
    name = 'getjobsinfo'
    allowed_domains = ['zhaopin.com']
    # start_urls = ['https://www.zhaopin.com/']
    def start_requests(self):
        yield Request(url='https://www.zhaopin.com/')
    def parse(self, response, **kwargs):
        jobs_list = Selector(text=response.text).xpath(
            "//a[@class='job-menu__sub__name']/text()").extract()  # 工作列表
        # citys_list = Selector(text=response.text).xpath(
        #     "//a[@class='city-nav__item__cities__a']/text()").extract()  # 工作地点
        print(jobs_list)
        # print(citys_list)
        for job in jobs_list:
            # for city in citys_list:
            #     url = f'http://sou.zhaopin.com/?jl={city}&kw={job}'
            #     yield Request(url=url, callback=self.jobs_parse)
            url = f'http://sou.zhaopin.com/?jl=成都&kw={job}'
            yield Request(url=url, callback=self.jobs_parse)
    def jobs_parse(self, response):
        doms = Selector(text=response.text).xpath(
            "//*[@id='positionList-hook']/div/div[@class='joblist-box__item clearfix']").extract()
        for dom in doms:
            ## 数据解析过程
            job_name = Selector(text=dom).xpath(
                "//span[@class='iteminfo__line1__jobname__name']/@title").extract_first()
            job_salary = Selector(text=dom).xpath(
                "//p[@class='iteminfo__line2__jobdesc__salary']/text()").extract_first()
            job_place = Selector(text=dom).xpath(
                "//ul[@class='iteminfo__line2__jobdesc__demand']/li[1]/text()").extract_first()
            job_experience = Selector(text=dom).xpath(
                "//ul[@class='iteminfo__line2__jobdesc__demand']/li[2]/text()").extract_first()
            job_education = Selector(text=dom).xpath(
                "//ul[@class='iteminfo__line2__jobdesc__demand']/li[3]/text()").extract_first()
            job_tag = Selector(text=dom).xpath(
                "//div[@class='iteminfo__line3__welfare']/div/text()").extract()
            company_name = Selector(text=dom).xpath(
                "//span[@class='iteminfo__line1__compname__name']/@title").extract_first()
            company_type = Selector(text=dom).xpath(
                "//div[@class='iteminfo__line2__compdesc']/span[1]/text()").extract_first()
            company_scale = Selector(text=dom).xpath(
                "//div[@class='iteminfo__line2__compdesc']/span[2]/text()").extract_first()
            link = Selector(text=dom).xpath(
                "//a[@class='joblist-box__iteminfo iteminfo']/@href").extract_first()
             ##  数据持久化
            job_info = JobInfo()
            job_info['job_name'] = job_name
            job_info['job_salary'] = job_salary
            job_info['job_place'] = job_place
            job_info['job_experience'] = job_experience
            job_info['job_education'] = job_education
            job_info['job_tag'] = job_tag
            job_info['company_name'] = company_name
            job_info['company_type'] = company_type
            job_info['company_scale'] = company_scale
            job_info['link'] = link
            # 将数据提交
            yield job_info

9.数据持久化

使用peewee持久化数据，在管道进行处理.

9.1安装peewee

命令：

pip install peewee

创建一个Model.py编写代码如下

from peewee import *

db = MySQLDatabase('wsx',
                   host="192.168.0.95", ## 主机地址
                   port=3306,  # 端口 默认3306
                   user="root", ## 用户名
                   password="meimima") ## 密码


class DataModel(Model):
    class Meta:
        database = db


class JobsInfo(DataModel):  #
    job_name = CharField(max_length="255")
    job_salary = CharField(max_length="255")
    job_place = CharField(max_length="255")
    job_experience = CharField(max_length="255")
    job_education = CharField(max_length="255")
    job_tag = TextField(default="")
    company_name = CharField(max_length="255")
    company_type = CharField(max_length="255")
    ## default表示默认，verbose_name表示字段描述
    company_scale = CharField(max_length="255", default="", verbose_name="")
    link = TextField(default="", verbose_name="")


db.create_tables([JobsInfo])

9.2编辑管道文件

打开piplines.py，编辑如下信息

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter

from test_spider.spiders.Model import JobsInfo

class TestSpiderPipeline:
    def __init__(self):
        pass

    def close_spider(self, spider):
        pass

    # 处理提交的item
    def process_item(self, item, spider):
        try:
            job_info = JobsInfo()
            job_info.job_name = item['job_name']
            job_info.job_salary = item['job_salary']
            job_info.job_place = item['job_place']
            job_info.job_experience = item['job_experience']
            job_info.job_education = item['job_education']
            job_info.job_tag = item['job_tag']
            job_info.company_name = item['company_name']
            job_info.company_type = item['company_type']
            job_info.company_scale = item['company_scale']
            job_info.link = item['link']
            job_info.save()
            print(f"{item['job_name']}保存成功")
        except (IndexError, TypeError, TimeoutError):
            print("保存失败")

整个项目结构，如图：

9.3最后在settings.py下启动这个管道。

大功告成！！！！

打开python控制台，输入scrapy crawl getjobsinfo启动我们的爬虫。

如果出现报错，如图：

表示mysql驱动没有安装：
命令安装：

pip install pymysql

如果出现：[scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt:这样的错误，如图：

修改settings.py文件，ROBOTSTXT_OBEY = False ,如图：

安装完成之后，再次执行：

就可以看到数据了：

10.调试

有可能会遇到抓不到数据，这时候就需要调试，这里提供pycharm工具的调试方式。

10.1 创建run.py文件

在settings.py的同级目录创建一个run.py文件，内容如下：


from scrapy import cmdline


name = 'getjobsinfo'
cmd = 'scrapy crawl {0}'.format(name)
cmdline.execute(cmd.split())

如图：

然后在你需要调试的地方打赏断点，然后右键run.py文件，选择run debug，如图：

运行之后，断点就会打到这里，如图：

天启AI社区

GitCode 天启AI是一款由 GitCode 团队打造的智能助手，基于先进的LLM（大语言模型）与多智能体 Agent 技术构建，致力于为用户提供高效、智能、多模态的创作与开发支持。它不仅支持自然语言对话，还具备处理文件、生成 PPT、撰写分析报告、开发 Web 应用等多项能力，真正做到“一句话，让 Al帮你完成复杂任务”。

更多推荐