超乎想象的精准—探索Qwen2.5-vl的识别魔法

大家好，我是烤鸭：今天写一篇使用开源的视频模型来进行图像/视频识别，使用的是千问模型qwen-vl-2B。

烤鸭的世界我们不懂

1604人浏览 · 2025-03-17 07:45:00

烤鸭的世界我们不懂 · 2025-03-17 07:45:00 发布

大家好，我是烤鸭：

今天写一篇使用开源的视频模型来进行图像/视频识别，使用的是千问模型qwen-vl-2B。

引言

最近AI发展实现太快了，几周小版本，几个月大版本，比敏捷开发还迅速，关键是还都开源，这你受得了吗。今天分享一个代码的使用demo，虽然代码一搜有的是，但是考虑网络限制、显卡、各种安装包以及兼容性问题，光跑起来也够费劲的。

环境

win10

python 3.13

显卡是4060 Ti 8G

算力有限，本机测试使用的是 AdaptLLM/biomed-Qwen2-VL-2B-Instruct

安装torch和cuda

先安装cuda，无脑下载安装

CUDA Toolkit 12.8 Update 1 Downloads | NVIDIA Developer
在这里插入图片描述

再去 pytorch官网，看一下符合当前系统的版本。

https://pytorch.org/get-started/locally/
在这里插入图片描述

安装模型所需

如果使用huggingface，各种网络限制问题，hf-mirror的模型又不全。不如直接用魔塔社区。

https://www.modelscope.cn/models/

# 1.安装魔塔
pip install modelscope -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
# 2. 安装 PyTorch
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# 3. 安装其他依赖
pip install transformers opencv-python pillow

我这还安装了些其他的，版本可以参考，不过应该不需要那么多：
在这里插入图片描述

下载模型文件

modelscope download --model AdaptLLM/biomed-Qwen2-VL-2B-Instruct

代码

from modelscope import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-3B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

def video_rec(path: str, desc: str):
    # messages = [
    #     {
    #         "role": "user",
    #         "content": [
    #             {
    #                 "type": "video",
    #                 "video": path,
    #             },
    #             {"type": desc},
    #         ],
    #     }
    # ]
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": path,
                },
                {"type": desc},
            ],
        }
    ]

    # Preparation for inference
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )

    inputs = inputs.to("cuda")

    # Inference: Generation of the output
    generated_ids = model.generate(**inputs, max_new_tokens=128)
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    print(output_text)

if __name__ == '__main__':
    # 视频
    #path = "C:\\Users\\Administrator\\Downloads\\video_test.mp4";
    # 图像
    path = "C:\\Users\\Administrator\\Downloads\\image_test.png";
    desc = "Describe the video."
    video_rec(path, desc)

算力有限，跑视频的时候提示内存不足，所以只尝试了识别图片：

示例图片：
在这里插入图片描述

输出：

['The image shows the front view of a white electric vehicle, specifically a NIO ES6. The car is displayed in what appears to be a showroom or exhibition space. The vehicle has a sleek design with a prominent front grille and LED headlights. The NIO logo is visible on the grille, and the model name "es6" is displayed on the front bumper. In the background, there are other vehicles and some furniture, including chairs and a table with bottles on it. The overall setting suggests that this is a promotional or sales environment for the NIO ES6.']

翻译：

该图像显示了一辆白色电动汽车的正视图，特别是NIO ES6。这辆车被展示在一个看起来像展厅或展览空间的地方。该车设计时尚，前格栅和LED大灯突出。格栅上可见NIO徽标，前保险杠上显示车型名称“es6”。在背景中，还有其他车辆和一些家具，包括椅子和一张放着瓶子的桌子。整体环境表明，这是NIO ES6的促销或销售环境。

可以看的是识别还是比较准确的，车型识别出来了，车型介绍也有，还有图片上的物品和对当前环境的预估。

总结

现在的AI模型眼花缭乱，可以说是百花齐放。不管是开源的还是收费的，而且收费还都不贵。只有你想不到，基本没有死角，文本、音频、视频、图像等等全覆盖。

作为程序员或者普通人应该怎么抓住这波红利，我觉得是把以前很多想做不能做的事情变得简单了，有人说使用claude+ cursor 3小时开发一款小程序，是完全有可能的。

互联网年代说的是人人都是产品经理，需要产出想法，让团队落地。而AI时代我觉得人人都是实践者，只要你有想法，AI可以辅助你快速的实现，简单的系统根本不需要团队。

现在很火的利用AI数字人进行带货，国内或者国外的，其实就是一个很好的支点。AI省了很多人的工作，提高了效率。

但是普适的AI还是有局限，就像人一样，未来需要的多是专业人员。

未来可能的几个发展方向：

业务领域的AI（业务型，懂公司业务、行业发展）
专业场景的AI（比如cursor代替研发等）
犄角旮旯的AI（比如识别图片，以前可能要图片灰度、降噪、归一，现在可以AI直接可以告诉结果）

其他报错记录

AssertionError: Torch not compiled with CUDA enabled

在这里插入图片描述

# 1. 安装 PyTorch
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# 2. 安装其他依赖
pip install transformers opencv-python pillow

一直卡在 Downloading shards 0%

importlib.metadata.PackageNotFoundError: No package metadata was found for bitsandbytes

pip install bitsandbytes -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2.

pip install flash-attn -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

安装flash-attn失败

ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: 'C:\\Users\\Administrator\\AppData\\Local\\Temp\\pip-install-30bhguhc\\flash-attn_c97f9dbe48984242b2e7cf5ef51a4e62\\csrc/composable_kernel/client_example/24_grouped_conv_activation/grouped_convnd_fwd_scaleadd_scaleadd_relu/grouped_conv_fwd_scaleadd_scaleadd_relu_bf16.cpp'
HINT: This error might have occurred since this system does not have Windows Long Path support enabled. You can find information on how to enable this at https://pip.pypa.io/warnings/enable-long-paths

启用Windows长路径支持

1、通过组策略编辑器启用（适用于Windows专业版、企业版和教育版）：
按 Win + R 键打开运行对话框，输入 gpedit.msc 并按回车。
在本地组策略编辑器中，导航到计算机配置 -> 管理模板 -> 系统 -> 文件系统。
双击“启用Win32长路径”选项，选择“已启用”，然后点击确定。
2、通过注册表编辑器启用（适用于所有版本的Windows）：
按 Win + R 键打开运行对话框，输入 regedit 并按回车。
导航到 HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem。
找到名为 LongPathsEnabled 的DWORD值。如果没有找到，则需要手动创建它。
将其值设置为 1，然后重启电脑以使更改生效。

flash-attention保姆级安装教程_flashattention安装-CSDN博客

文章参考

https://qwenlm.github.io/zh/blog/qwen2.5-vl/

Qwen/Qwen2.5-VL-7B-Instruct · HF Mirror

https://blog.csdn.net/qq_39448884/article/details/123908752

https://pytorch.org/

https://blog.csdn.net/XieRuily/article/details/123670141

https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_version=10&target_type=exe_local

天启AI社区

GitCode 天启AI是一款由 GitCode 团队打造的智能助手，基于先进的LLM（大语言模型）与多智能体 Agent 技术构建，致力于为用户提供高效、智能、多模态的创作与开发支持。它不仅支持自然语言对话，还具备处理文件、生成 PPT、撰写分析报告、开发 Web 应用等多项能力，真正做到“一句话，让 Al帮你完成复杂任务”。

更多推荐