一、BeautifulSoup

官方网站:BeautifulSoup官方网站
BeautifulSoup库又称bs4库
beautifulSoup 可以对html和xml进行解析,并且提供相关信息。
官方话术:BeautfulSoup 可以对你提供的任何格式进行爬取,并且进行树形解析
解析原理:将你提供给他的文档当作一锅汤,并且褒制这锅汤

二、BeautifulSoup的安装

一般需要同步安装requests库

pip install beautifulsoup4
pip install requests

三、运行方式

抓取网页运行:

demo = request.get("https://python123.io/ws/demo.html")
soup = Beatifulsoup(demo, "html.parser")
print(soup.prettify())

运行结果:

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

解析标签字符串:

from bs4 import BeautifulSoup
soup = BeautifulSoup("<p>this is p</p>", "html.parser")
print(soup.prettfiy())

运行结果:

<p>this is p<p>

解析文件中的标签:
test.html 源代码,放在任意文件夹中:

<!-- test.html -->
<!DOCTYPE html>
<html lang="en">
	<head>
		<meta charset="UTF-8">
		<title>Title</title>
	</head>
	<body>
		<p class="title">
			<b>this is b</b>
		</p>
	</body>
</html>

解析文件

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("文件路径/test.html"), "html.parser")
print(soup.prettfiy())

运行结果:

<!-- test.html -->
<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Title
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    this is b
   </b>
  </p>
 </body>
</html>

四、BeautifulSoup的基本元素

Beautiful Soup是解析、遍历、维护“标签树”的功能库,只要提供的类型是标签类型,都能进行解析。在这里插入图片描述

五、Beautiful Soup 解析器

Beautiful Soup 提供了四种解析器,所有解析器都可以解析html和xml内容:

解析器 使用方法 条件
bs4的HTML解析器 BeautifulSoup(soup, “html.parser”) pip3 install beautifulsoup4
lxml的HTML解析器 BeautifulSoup(soup, “lxml”) pip3 install lxml
lxml的XML解析器 BeautifulSoup(soup, “xml”) pip3 install xml
html5lib解析器 BeautifulSoup(soup, “html5lib”) pip3 install html5lib

六、Beautiful Soup 基本元素

Beautiful Soup 有五种基本元素:

基本元素 说明
Tag 标签,最基本的信息组织单元,分别用<>和</>表明开头和结尾,格式:.Tag
Name 标签的名字,<p>…</p>的名字是’p’,格式:<Tag>.name
Attributes 标签的属性,字典形式组织,格式:<Tag>.attrs
NavigableString 标签内非属性字符串,<>…</>中的字符串,格式:<Tag>.String
Comment 标签内字符串的注释部分,一种特殊的Comment类

Beautiful Soup获取基本元素

获取标签内容:

demo = request.get("https://python123.io/ws/demo.html")
soup = Beatifulsoup(demo, "html.parser")

print(f"soup.title: {soup.title}")	# 打印title标签 Tag
print(f"soup.a:{soup.a}")		# 打印a标签的内容 Tag
print(f"soup.a.name:{soup.a.name}")		# 打印a标签的名字 Name
print(f"soup.a.attrs:{soup.a.attrs}")		# 打印a标签的属性 Attributes
print(f"soup.a.attrs['class']:{soup.a.attrs['class']}")		# 打印a标签中class属性的值
print(f"soup.a.string:{soup.a.string}")	# 打印a标签中的字符串内容
print(f"soup.a.string的类型:{type(soup.a.string)}")		# 打印a标签中字符串的类型 

soup1 = BeautifulSoup("<b><!--this is comment--></b><p>this is not comment</p>", "html.parser")
print(f"soup1.string:{soup1.b.string}")
print(f"soup1.string的类型:{type(soup1.b.string)}")

运行结果:

soup.title: <title>This is a python demo page</title>
soup.a:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
soup.a.name:a
soup.a.attrs:{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
soup.a.attrs['class']:['py1']
soup.a.string:Basic Python
soup.a.string的类型:<class 'bs4.element.NavigableString'>

soup1.string:this is comment
soup1.string的类型:<class 'bs4.element.Comment'>

注意:当文本中存在多个一样的标签时,只会获取第一个标签的内容,如例子中的a标签,只获取了第一个标签的内容

七、Beautiful Soup 遍历方式

Beautiful Soup 的将整个html页面中的标签看成一个标签树,其使用的遍历方式分为下行遍历、上行遍历、平行遍历
在这里插入图片描述

下行遍历

属性 说明
.contents 子节点的列表,将<tag>所有儿子节点存入列表
.children 子节点的迭代类型,与.contents类似,用于循环遍历儿子节点
.descendants 子孙节点的迭代类型,包含所有的子孙节点,用于循环遍历
import requests
from bs4 import BeautifulSoup

r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, "html.parser")
print(f"soup.head:{soup.head}")  # 获取head标签
print(f"soup.head.contents:{soup.head.contents}")  # 获取head标签下的子标签
print(f"soup.body.contents:{soup.body.contents}")  # 获取body标签下的子标签
print(f"soup.body.contents[1]:{soup.body.contents[1]}")  # 获取body标签下的第一个子标签

运行结果:

soup.head:<head><title>This is a python demo page</title></head>
soup.head.contents:[<title>This is a python demo page</title>]
soup.body.contents:['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
soup.body.contents[1]:<p class="title"><b>The demo python introduces several python courses.</b></p>

注意:字符串节点也是一个标签,例如\n会车,它也是body的一个标签,在python自带的IDEL环境下打印contents时会显示出来,所以查找真正的第一个标签时,列表的下标可能会为[1]

上行遍历

属性 说明
.parent 节点的父亲标签
.parents 节点先辈标签的迭代类型,用于循环遍历先辈节点
import requests
from bs4 import BeautifulSoup

r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, "html.parser")
print(f"soup.title.parent:{soup.title.parent}")  # 获取title标签的父亲标签
print(f"soup.html.parent:{soup.html.parent}")  # 获取html标签的父亲标签
print(f"soup.parent:{soup.parent}")  # 获取soup的父亲标签

运行结果:

soup.title.parent:<head><title>This is a python demo page</title></head>
soup.html.parent:<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
soup.parent:None

平行遍历

属性 说明
.next_sibling 返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling 返回按照HTML文本顺序的上一个平行节点标签
.next_sibling 迭代类型,返回按照HTML文本顺序的后续所有平行标签
.previous_siblings 迭代类型,返回按照HTML文本顺序的前序所有平行标签

注意:所有的平行遍历必须发生在同一个父亲节点下,如果不是同一个父亲节点下的,无法构成平行关系

import requests
from bs4 import BeautifulSoup

r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, "html.parser")
print(f"soup.a.next_sibling:{soup.a.next_sibling}")
print(f"soup.a.next_sibling.next_sibling:{soup.a.next_sibling.next_sibling}")
print(f"soup.a.previous_slibing:{soup.a.previous_sibling}")
print(f"soup.a.previous_slibing.previous_sibling:{soup.a.previous_sibling.previous_sibling}")

运行结果:

soup.a.next_sibling: and 
soup.a.next_sibling.next_sibling:<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
soup.a.previous_slibing:Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
soup.a.previous_slibing.previous_sibling:None	

注意:任何一个平行标签、上行标签和下行标签,都有可能存在 NavigableString 类型的

八、HTML格式化和编码

bs4格式化

bs4中使用 prettify() 方法格式化html,可以对html文本的每一个标签增加换行符:
未使用 prettify() 方法:

import requests
from bs4 import BeautifulSoup

r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
print(demo)

运行结果:

<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>

使用了 prettify() 方法:

import requests
from bs4 import BeautifulSoup

r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, "html.parser")
print(BeautifulSoup(demo, "html.parser").prettify())

运行结果:

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

bs4编码

bs4将所有读入的html文件或字符串都换成了UTF-8编码,由于python3解释器默认使用的就是UTF-8编码,使用python2需要进行转码

Logo

GitCode 天启AI是一款由 GitCode 团队打造的智能助手,基于先进的LLM(大语言模型)与多智能体 Agent 技术构建,致力于为用户提供高效、智能、多模态的创作与开发支持。它不仅支持自然语言对话,还具备处理文件、生成 PPT、撰写分析报告、开发 Web 应用等多项能力,真正做到“一句话,让 Al帮你完成复杂任务”。

更多推荐