python爬虫爬数据 水稻
为了防止网页反爬,要先进行UA伪装提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档文章目录前言一、pandas是什么?二、使用步骤1.引入库2.读入数据总结前言为了进行数据分析,需要搜集大量相关信息,运用python把数据存进excel文档,再运用sdata进行match一、python爬数据二、使用步骤1.UA伪装代码如下:headers = { 'User-Agent':'Mo
·
文章目录
前言
为了进行数据分析,需要搜集大量相关信息,运用python把数据存进excel文档,再运用sdata进行match
一、python爬数据
二、使用步骤
1.UA伪装
代码如下:
headers = { 'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.109 Safari/537.36' }
2.读入网页数据,防止乱码
代码如下:
url = "https://www.ricedata.cn/variety/identified/uhhl_%s.htm" %page res = requests.get(url) res.encoding = res.apparent_encoding text = res.text
3.网页解析,创建csv文件
代码如下:
main_page = BeautifulSoup(text, "html.parser") f = open("ricedatashanghai.csv", mode = "a")
4.寻找目标内容
代码如下:
table = main_page.find("table", attrs={"cellpadding":"2"}) trs = table.find_all("tr") for tr in trs: lst = tr.find_all("td") if len(lst) != 0: for td in lst: #print(td.text) f.write(td.text.strip()) f.write(",") f.write("\n")
5.定义一个函数实现多页循环,要找到网页规律
代码如下:
table = main_page.find("table", attrs={"cellpadding":"2"}) trs = table.find_all("tr") for tr in trs: lst = tr.find_all("td") if len(lst) != 0: for td in lst: #print(td.text) f.write(td.text.strip()) f.write(",") f.write("\n")
for page in range(1,6): down(page)
import sys
from typing import io
import requests
from bs4 import BeautifulSoup
#UA伪装
headers = {
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.109 Safari/537.36'
}
def down(page):
#url = "https://www.ricedata.cn/variety/identified/vejd_%s.htm" % pag
#text = requests.get("https://www.ricedata.cn/variety/identified/vejd_%s.htm" % page, headers =headers).text
url = "https://www.ricedata.cn/variety/identified/jdsu_%s.htm" %page
res = requests.get(url)
res.encoding = res.apparent_encoding
text = res.text
# text = requests.get("https://www.ricedata.cn/variety/identified/vejd_2.htm", headers = headers).text
main_page = BeautifulSoup(text, "html.parser")
f = open("ricedatajiangsu.csv", mode = "a")
#table = main_page.find("table", attrs={"style": "font-family:Times New Roman; border-top:solid black 2px; border-bottom:solid black 2px"})
table = main_page.find("table", attrs={"cellpadding":"2"})
trs = table.find_all("tr")
for tr in trs:
lst = tr.find_all("td")
if len(lst) != 0:
for td in lst:
#print(td.text)
f.write(td.text.strip())
f.write(",")
f.write("\n")
for page in range(1,19):
down(page)
import sys from typing import io import requests from bs4 import BeautifulSoup #UA伪装 headers = { 'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.109 Safari/537.36' } def down(page): #url = "https://www.ricedata.cn/variety/identified/vejd_%s.htm" % pag #text = requests.get("https://www.ricedata.cn/variety/identified/vejd_%s.htm" % page, headers =headers).text url = "https://www.ricedata.cn/variety/identified/uhhl_%s.htm" %page res = requests.get(url) res.encoding = res.apparent_encoding text = res.text # text = requests.get("https://www.ricedata.cn/variety/identified/vejd_2.htm", headers = headers).text main_page = BeautifulSoup(text, "html.parser") f = open("ricedatashanghai.csv", mode = "a") #table = main_page.find("table", attrs={"style": "font-family:Times New Roman; border-top:solid black 2px; border-bottom:solid black 2px"}) table = main_page.find("table", attrs={"cellpadding":"2"}) trs = table.find_all("tr") for tr in trs: lst = tr.find_all("td") if len(lst) != 0: for td in lst: #print(td.text) f.write(td.text.strip()) f.write(",") f.write("\n") for page in range(1,6): down(page)

GitCode 天启AI是一款由 GitCode 团队打造的智能助手,基于先进的LLM(大语言模型)与多智能体 Agent 技术构建,致力于为用户提供高效、智能、多模态的创作与开发支持。它不仅支持自然语言对话,还具备处理文件、生成 PPT、撰写分析报告、开发 Web 应用等多项能力,真正做到“一句话,让 Al帮你完成复杂任务”。
更多推荐
所有评论(0)