十五天Python系统学习教程第十一天

通过第十一天的学习，您将掌握：1️⃣ Python并发编程的核心模型与限制2️⃣ 协程在高IO场景下的性能优势3️⃣ 多进程并行计算的最佳实践4️⃣ 复杂并发系统的调试与优化技巧

heimeiyingwang

1048人浏览 · 2025-04-06 10:00:00

heimeiyingwang · 2025-04-06 10:00:00 发布

📅 Day 11 详细学习计划：Python并发与并行编程

学习目标
✅ 理解Python并发模型（对比Java的多线程与线程池）
✅ 掌握asyncio协程编程（对比Java的虚拟线程）
✅ 实现多进程加速计算密集型任务
✅ 完成高并发网络爬虫实战

一、并发模型核心对比（Java vs Python）

特性	Java	Python	核心差异
线程实现	OS线程（`java.lang.Thread`）	OS线程（受GIL限制）	Python线程不适合CPU密集型任务
协程支持	虚拟线程（Loom项目）	`asyncio`协程（单线程异步）	Python协程更轻量
进程并行	`ProcessBuilder`创建新进程	`multiprocessing`模块	Python进程间通信更简洁
线程池	`ExecutorService`	`concurrent.futures.ThreadPoolExecutor`	接口设计类似

二、多线程与GIL机制（1小时）

2.1 Python线程使用（对比Java）

import threading  

# 创建线程（类似Java的Runnable）  
def task(n):  
    print(f"线程执行: {n}")  

# 启动线程（对比Java的Thread.start()）  
threads = []  
for i in range(3):  
    t = threading.Thread(target=task, args=(i,))  
    threads.append(t)  
    t.start()  

for t in threads:  
    t.join()  # 等待线程结束

GIL限制示例：

# CPU密集型任务多线程无加速效果  
def count_down(n):  
    while n > 0:  
        n -= 1  

# 单线程执行  
%time count_down(10**7)  # 约0.3秒  

# 双线程执行  
t1 = threading.Thread(target=count_down, args=(5e6,))  
t2 = threading.Thread(target=count_down, args=(5e6,))  
%time t1.start(); t2.start(); t1.join(); t2.join()  # 约0.6秒（更慢）

三、协程与异步编程（1小时）

3.1 asyncio基础（对比Java虚拟线程）

import asyncio  

async def fetch_data(url):  
    print(f"开始请求: {url}")  
    await asyncio.sleep(1)  # 模拟IO等待  
    return f"{url}响应数据"  

async def main():  
    # 并发执行（类似Java的CompletableFuture）  
    task1 = asyncio.create_task(fetch_data("https://api/1"))  
    task2 = asyncio.create_task(fetch_data("https://api/2"))  
    results = await asyncio.gather(task1, task2)  
    print(results)  

asyncio.run(main())  # 总耗时约1秒（非2秒）

3.2 异步HTTP客户端（对比Java的AsyncHttpClient）

import aiohttp  

async def fetch_page(url):  
    async with aiohttp.ClientSession() as session:  
        async with session.get(url) as response:  
            return await response.text()  

async def crawl():  
    urls = ["https://example.com", "https://example.org"]  
    tasks = [fetch_page(url) for url in urls]  
    pages = await asyncio.gather(*tasks)  
    print(f"抓取到{len(pages)}个页面")

四、多进程并行计算（1小时）

4.1 进程池（对比Java的ForkJoinPool）

from multiprocessing import Pool  

def cpu_intensive(n):  
    return sum(i*i for i in range(n))  

if __name__ == "__main__":  
    with Pool(4) as pool:  # 4进程  
        results = pool.map(cpu_intensive, [10**7]*4)  
    print(sum(results))  # 比单进程快约4倍（绕过GIL）

4.2 进程间通信（IPC）

from multiprocessing import Process, Queue  

def worker(q):  
    q.put("子进程数据")  

if __name__ == "__main__":  
    q = Queue()  
    p = Process(target=worker, args=(q,))  
    p.start()  
    print(q.get())  # 接收数据  
    p.join()

五、实战项目：高并发新闻爬虫（1.5小时）

5.1 需求分析

并发抓取多个新闻网站首页
提取标题与关键内容
统计高频关键词
支持同步/异步两种模式

5.2 核心实现

异步爬虫核心：

import aiohttp  
from bs4 import BeautifulSoup  

async def fetch_news(url):  
    async with aiohttp.ClientSession() as session:  
        async with session.get(url) as resp:  
            html = await resp.text()  
            soup = BeautifulSoup(html, "lxml")  
            return {  
                "url": url,  
                "title": soup.title.text.strip(),  
                "content": soup.find("div", class_="content").text[:100]  
            }  

async def main(urls):  
    tasks = [fetch_news(url) for url in urls]  
    return await asyncio.gather(*tasks)

多进程关键词统计：

from multiprocessing import Pool  
from collections import Counter  

def count_keywords(text):  
    words = re.findall(r"\w+", text.lower())  
    return Counter(words)  

if __name__ == "__main__":  
    news_data = [...]  # 爬取结果  
    with Pool() as pool:  
        counters = pool.map(count_keywords, [n["content"] for n in news_data])  
    total = sum(counters, Counter())  
    print(total.most_common(10))

六、Java开发者注意事项

GIL的影响范围
- 仅影响CPython解释器的原生线程
- 使用C扩展（如NumPy）或multiprocessing可规避
异步编程范式
- Python的async/await是语法糖，Java的虚拟线程更透明
- Python事件循环需显式管理（asyncio.run()）

进程序列化限制

Python多进程间传递对象需可pickle序列化

# 自定义类的实例需实现__reduce__方法  
class Data:  
    def __init__(self, value):  
        self.value = value  
    def __reduce__(self):  
        return (self.__class__, (self.value,))

七、扩展练习

实现协程池限制并发数

import asyncio  
from aiothrottle import Throttler  

async def limited_crawl(urls, concurrency=5):  
    throttler = Throttler(concurrency)  
    async with aiohttp.ClientSession() as session:  
        tasks = [throttler.acquire(fetch(session, url)) for url in urls]  
        return await asyncio.gather(*tasks)

结合线程与协程

# 在协程中执行阻塞IO操作  
async def run_blocking(func, *args):  
    loop = asyncio.get_event_loop()  
    return await loop.run_in_executor(None, func, *args)

分布式任务队列

# 使用Celery实现（类似Java的Quartz）  
from celery import Celery  

app = Celery("tasks", broker="redis://localhost")  
@app.task  
def process_data(data):  
    return data.upper()  

process_data.delay("hello")  # 异步执行

通过第十一天的学习，您将掌握：
1️⃣ Python并发编程的核心模型与限制
2️⃣ 协程在高IO场景下的性能优势
3️⃣ 多进程并行计算的最佳实践
4️⃣ 复杂并发系统的调试与优化技巧

天启AI社区

GitCode 天启AI是一款由 GitCode 团队打造的智能助手，基于先进的LLM（大语言模型）与多智能体 Agent 技术构建，致力于为用户提供高效、智能、多模态的创作与开发支持。它不仅支持自然语言对话，还具备处理文件、生成 PPT、撰写分析报告、开发 Web 应用等多项能力，真正做到“一句话，让 Al帮你完成复杂任务”。

更多推荐